Objective Tests 

Their construction 

and analysis age 
A practical handbook ee 
for teachers ; 


by James Brown 
Longmans 


Objective Tests 


In preparation for this series 


OBJECTIVE TESTS IN CHEMISTRY 
by G. H. Walkley, M.Sc. 


OBJECTIVE TESTS IN ENGLISH (Structure and Lexis) 
by Peter Gibbs, B.A. and M. E. Jolley, B.A. 


OBJECTIVE TESTS IN ENGLISH (Comprehension and Summary) 
by R. W. Noble, B.A. 


OBJECTIVE TESTS IN MATHEMATICS 
by James Brown, B.Sc., 


Other titles will follow 


Objective Tests 
Their construction and analysis 
A practical handbook for teachers 


J. Brown 
B.Sc.(Hons.); M.Ed.; F.I.S.; F.S.S.; A.F.I.M.A. 


Longmans 


Longmans, Green & Co. Ltd., 
48 Grosvenor Street, London, W.1 


Associated companies, branches and representatives 
throughout the world 


© James Brown 1966 


First published 1966 


— "C34. Lan: 
Aum n (68. 


PRINTED IN GREAT BRITAIN BY 
SPOTTISWOODE, BALLANTYNE AND CO, LTD., LONDON AND COLCHESTER 


od 


Preface 


This small handbook has grown out of the 
author’s teaching and testing experience of the last 
eighteen years, in the United Kingdom, the Sudan 
and West Africa. It owes a lot to former tutors in 
this field. During the last seven years, while the 
author was working for the West African Examina- 
tions Council in Ghana and Nigeria, a great deal of 
basic research into the use of multiple-choice tests 
in various examinations in West Africa was carried 
out and the results are today widely used in selective 
testing and in the G.C.E. *O' level examinations of 
the Council. 


There was a need to initiate practical methods of 
training both staff and local teachers in the writing 
of material and in large-scale testing, without which 
the resulting tests could not hope to be worth while. 


Credit for the expansion of this work in West 
Africa must go to Mr. J. Deakin, O.B.E., the second 
Registrar, who realised that without such a test 
programme, the Council would have been over- 
whelmed by the rapid increase in the number of 
candidates for all examinations, itself merely a 


rA 


reflection of the expansion of all levels of education 
in West Africa. Equally the first Registrar of the 
Council, Mr. K. Humphreys, pioneered the exten- 
sive use of punched-card equipment and electronic 
calculators in the Council, without which the sheer 
volume of material to be processed could not have 
been handled. 


Originally written to help teachers in West Africa 
to gain some ideas of the methods of construction 
and analysis of objective tests, this book may also 
be of some use in the United Kingdom, particularly 
in Teacher Training Colleges. 


To my former colleagues in the Council may I 
express my thanks for their loyal help in building up 
the testing programme to its present level and 
equally my confidence that they will be not only 
capable of carrying on the programme, but indeed 
will enrich it with their intimate knowledge of the 
educational structure of their countries. 


My gratitude must also go to my wife who has 
typed the manuscript. 


Contents 


Preface 


PART 1. On examinations 

1. The functions of examinations 

2. What kind of examination? 

3. The written examination in essay form 

4. The advantages and limitations of objective tests 


PART 2. Objective test construction techniques 
. Forms of objective test 

. Types of test functions 

. Constructing a test schedule 

. The training of item-writers 

. Item writing 

. The review of written items 

. Trial testing 

- The synthesis of a final paper 

. Validation of a test 


NO OO —1 Oy ta 4 09 T9 — 


PART 3. Specimen items illustrated and discussed 
Mathematics, Biology, Geography, English Language, Foreign Languages, Science 
PART 4. Elementary item analysis for use in class-room tests 
1. The average score 
2. Evaluating each item separately: 
(a) Facility, (b) Discrimination 


PART 5. Compiling a test for class-room use 


PART 6. Statistics needed for analysis of objective tests 
1. Notation 

2. The average or mean mark in a test 
3. The standard deviation of a mark distribution 
4. Skewness, quartiles and percentile scores 
5. Accuracy of working 
6. Item analysis 

7. Reliability of a test 

8. The standard error of a test score 
9. Correlation between test scores 


Selected Bibliography 
Appendix 1. Specimen answer cards for multiple choice tests 
Appendix 2. Statistical tables 
Table 1. Double tetrachoric correlation coefficients 
Table 2. Values of /1— r, where r=reliability of an objective test 
Table 3. The normal curve in percentile areas; and ordinates 


vi 


Part 1. On Examinations 


1. The functions of examinations 


Any examination is an attempt to measure some 
ability, knowledge or personality characteristic of 
the examinees. This implies firstly that the measure 
is possible, and secondly, that the order of merit 
obtained is valid. It has for long been the traditional 
assumption that examinations always will exist and 
that they therefore have to be passed. The most usual 
reaction to this view is to try and abolish examina- 
tions. The main argument used by abolitionists is 
that they largely fail to measure the specific ability 
or knowledge intended and that the resulting order 
of merit is therefore spurious. Support for their view 
is encouraged by the devastating evidence produced 
by two classic studies: ‘An Examination of Examina- 
tions’ by Hartog and Rhodes (17) and Valentine's 
‘The Reliability of Examinations’ (19). The main 
theme of each is that the same piece of written work 
can lead to widely different assessments, not only by 
different examiners, but by the same examiner on 
different occasions. It is, of course, the traditional 
form of examination which is criticised in these 
works, namely, the form requiring longish essay- 
type answers in response to a small number of 
questions selected from an extensive syllabus. 

Of course a badly-set examination paper, of 
whatever type, cannot result in any valid measure. 
An examination is a bad one if, despite the apparent 
fairness of the questions, there is a wide divergence 
of opinion between examiners on the mark to be 
awarded to any given set of answers. A good 
examination, therefore, needs to satisfy three 
criteria: (i) it must be neither too demandin g, nor too 
easy, for the level of the candidates; (ii) it must 
measure, and be proved to measure, the specific 
ability or knowledge it is intended to measure; 
(iii) it must be capable of producing answers which 
can be accurately marked, so that any examiner will 
award the same mark to the same piece of work. 

No traditional examination can hope to achieve 


perfection on all three criteria, although a really 
expertly-set paper can certainly come very close. 
The two reports mentioned above served the very 
valuable purpose of provoking research into 
methods of marking which would reduce the 
variations between examiners to the lowest possible 
level. With an experienced team of examiners and 
the systematic re-marking of scripts by at least two 
different examiners, it is possible to come very close 
to an agreed mark on each script. By a study of the 
results from year to year on papers set on the same 
syllabus and taken, it is assumed, by candidates of a 
similar range of ability, it is possible to ensure that 
the difficulty level of successive papers is approxi- 
mately constant and that if any difference occurs, it 
can be allowed for in the marking. Hence the first 
and third criteria can be reasonably satisfied. The 
second, that the measure shall be of the specific 
ability or knowledge intended, is far too often taken 
for granted. After all, it is said, what else does a 
History paper measure, except knowledge of 
History or any other examination measure except 
competence in the subject on which questions have 
been set? 

The clearest additional ability being tested in most 
examinations is the ability to express the answers in 
clear, competent language. If a candidate who has a 
good knowledge of his subject is unable to com- 
municate this knowledge accurately, because of his 
inability to construct suitable sentences quickly 
enough, he is likely to receive far fewer marks than 
his real knowledge justifies. It is not the fault of the 
examiner—he cannot award marks for material not 
presented. Yet it does illustrate the inevitable 
handicap suffered in most subjects by a candidate 
whose prose construction is poor. Now that 
English Language is no longer an essential require- 
ment for G.C.E. purposes, there is some risk that 
standards in English will fall and this handicap 
become even more serious. Even in mathematical 
subjects, where the need to write competent English 


1 


is less, there is still nevertheless a need to compose 
short, explanatory introductions to mathematical 
statements. Examiners often report on the poor 
presentation of mathematical work and the lack of 
necessary explanations, for which mark penalties 
are imposed. There are less obvious and sometimes 
accidental additions to the specific knowledge under 
test. It occurs from time to time that a candidate 
offering both History and Geography finds that he 
can answer a question particularly well because his 
knowledge of the other subject supplements his 
material. Similar connections often occur between 
Physics and Chemistry and, less often, between 
Chemistry and Biology. It is, of course, very desir- 
able educationally to integrate subjects in this way 
whenever possible, yet since there is no compulsion 
to group subjects in this way, one group of can- 
didates may quite fortuitously have an advantage 
over another group who do not happen to choose 
the same subject combinations. The remedy is to 
confine questions so strictly to each separate syllabus 
that such overlap is impossible. Yet, to provide 
sufficient material in each question, essay-questions 
have to be reasonably broad, with the chance of 
overlap increased. 

It may be questioned whether accuracy of marking 
is essential. If, for example, the purpose of the 
examination is to decide between a pass and a 
failure, perhaps 80% of the scripts can be readily 
decided upon. Yet the remaining 20% will be 
border-line cases where there has to be some decision 
on an order of merit to determine the cutting line. 
The more categories of award there are, the greater 
the need for such orders of merit. The introduction 
of marks simply allows scripts to be readily com- 
pared and also assists in assessing the comparative 
standards adopted by different examiners. The total 
number of marks for a paper is immaterial, provided 
there is sufficient to indicate the separations between 
candidates’ work needed to classify them. 

Anexamination can have several kinds of function 
as its final objective. In educational work it can be 
used for three things: to measure attainment, for 
diagnosis, and for the prediction of a candidate’s 
probable future performance on some course of 
study. 

The first is the simplest and commonest; the 
measure of the attainment of a candidate after 
completing a prescribed course of study. Diagnostic 
or predictive examinations are often considered 


2 


together, but they are essentially different. A 
diagnostic examination attempts to answer the 
question: ‘For what kind of further education if any, 
is the candidate most suitable?’ A predictive examin- 
ation has a specific course of study in mind, and 
attempts to answer the question *How well will the 
candidate respond to this particular course of study? 

Itis perhaps unfortunate that one examination is 
used, often wrongly, to answer all three questions at 
once. The G.C.E. Ordinary Level examinations are 
intended as attainment tests at the conclusion of four 
or five years secondary education. The results are 
frequently used to select sixth form candidates; both 
diagnostically, in choosing their most profitable 
subjects of study, and predictively, of their eventual 
G.C.E. ‘advanced’ level success in the selected 
subjects. Such correlation studies which have been 
carried out, some by the writer, show such poor 
results that it is clear that the use of the earlier 
examination as a predictor of ‘A’ level success is 
unjustified. There are clearly other criteria needed. 

At lower levels, selection from primary schools for 
secondary school entrance is often carried out by 
means of a ‘Common Entrance Examination’ which 
attempts a purely predictive measure, since the 
general content of the secondary school course is 
already determined. In the United Kingdom, the 
transfer examination, often called wrongly the 
*11-plus' examination, contains a strong diagnostic 
element, since the form of secondary education of 
the candidate is decided largely by it within the 
tripartite pattern established by the Butler Act of 
1944 of grammar, technical and modern. Although 
the transfer examination is successful in the majority 
of cases, the failure to equate standards in the 
*modern' schools has led to the concept of ‘passing’ 
the examinations, which means in practice being 
accepted into a Secondary grammar-school. The 
introduction of comprehensive schools attempts to 
overcome the prejudice against this examination 
Which is often based on ignorance of its main 
objective. The selection is Still carried out almost 
inevitably inside the comprehensive school, since 
Children are not of equal ability, nor have they 
identical interests. This reaction to the form of an 
examination is vividly illustrated by Miss H. Lister 
in her contribution: ‘The Effects of External 
Examinations on the School’ (External Examina- 
tions in Secondary Schools, edited by G. B. Jeffrey: 
Harrap, 1958 (18)) in which she says: 


= al 


‘The primary function of any examination, 
internal or external, is to summarise and test the 
work done up to a given point, not to determine the 
work that shall be done. The diagnostic function is 
another thing altogether, full of danger unless 
carefully kept apart from the process of education. 
It is only too well known, for example, what a 
widespread influence has been exerted on primary 
teaching by the ‘11-plus’ transfer examination, so 
carefully designed as an objective test of ability and 
promise, and so serious—sometimes disastrous—on 
the years of teaching that precede it. It takes 
teachers with an extraordinarily firm grasp of their 
principles to avoid preparing pupils for the kind of 
test they are to undergo and to be judged by; and 
that holds good at the grammar-school stage as 


well.’ 
It is clear that the essential function of each 


examination should be defined, and its results 
limited to that function. 


2. What kind of examination ? 


An interview by a headmaster or a group of teachers 
is an examination. As a result of the interview each 
candidate is graded: usually he is selected or rejected 
by the headmaster of the school to which he has been 
seeking admission. There are usually other tests as 
well, which lead to a smaller group of candidates 
being chosen for interview. Does the interview, 
however, add anything to what we have learned from 
other measures? One possible thing may be an 
assessment of the personality of the candidate, 
although this is difficult in a short interview. It 
would be as well to remind ourselves once more of 
the evidence in the Hartog and Rhodes’ report, in 
which they described the completely different 
assessments of the same civil service applicants made 
by two different, but equally experienced panels of 
interviewers. 

A second assessment is frequently obtained by 
using the teacher’s estimates of their candidates 
abilities. Clearly if such assessments are based on 
several years’ teaching and observation, they are of 
considerable value. The difficulty is usually in com- 
paring estimates from different schools. At present, 
estimates are used only in doubtful cases where the 
candidate’s performance is erratic or sickness has 
prevented the full papers being taken, and then 


A2 


always to the candidate's advantage. As this is 
generally known, headmasters naturally tend to be 
even more optimistic than they usually are. Thus the 
reliability of the estimates falls. One way out is to 
impose a scale on the schools so as to induce, in 
time, a standard which would be uniform for all 
schools. At least by comparing scaled estimates year 
by year, with actual performance, the reliability of 
each school could be assessed and their estimates 
modified accordingly. 


3. The written examination in essay 
form 


The essay form of paper, almost exclusively used at 
the G.C.E. Ordinary level and upwards in the 
United Kingdom, has been largely replaced in the 
United States of America by one or more forms of 
objective test. 

There is nothing inherently good about either 
form of examination; each must prove its value. For 
many years it was taken as axiomatic that the only 
way to test a candidate's knowledge of a subject was 
to set him the task of writing one or more essays, 
‘to give him a chance to show what he has learned’. 
It was also earlier taken for granted that the main 
task of education was to impart facts, and that by 
some mysterious alchemy the human brain would 
learn to reason with and from these facts, linking 
‘subject’ to ‘subject’ to achieve the integrated whole 
of an educated man. There were thus theories which 
stated that the purpose of learning by memory, 
Euclidean geometrical theorems, or irregular verbs, 
or whole quotations from the classics, was to teach 
one to reason logically. Individuals vary a great deal; 
a few may achieve this integration on such an 
educational diet, the many do not. Perhaps most 
people of average intelligence, fed on facts alone, 
can, through their everyday experiences, link up 
facts and deduce simple propositions from them. In 
general, however, there is no real proof that such 
things follow automatically from the mode of 
teaching underlying this type of examination. 

One of the major problems imposed by papers in 
the traditional form is that the choice of relatively 
few questions from a wide syllabus introduces a 
sizable element of luck for the candidates. Indeed, 
in many subjects, popular topics recur, and this 
allows experienced teachers to estimate with some 


3 


accuracy the chance of certain topics appearing, and 
thus they can coach their candidates along certain 
lines. If their ‘question-spotting’ succeeds, such 
candidates receive disproportionately high marks in 
relation to their real knowledge of the full syllabus. 
In general, such papers allow a choice of questions. 
This is primarily intended to overcome the element 
of luck in the selection of questions, and if a very 
wide choice is allowed, it can be fairly said that all 
candidates have had an equal chance to choose their 
preferred topics. On the other hand, this wide 
diversity of topics introduces a further problem. 
Inevitably some questions will be attempted by only 
a few candidates, others by almost all of them. It 
would take a very experienced examiner indeed to 
assess accurately, in relation to the marking guide, 
the value of an answer, of which he may see only a 
single example, in comparison with those of which 
he may see hundreds. An example of the variability 
of choice came to light in the statistical data on an 
English Language paper which the writer analysed 
during the preparation of the report in ‘English 
Language Examining’ (by W. D. Grieve) for the 
West African Examinations Council (16). In this, 
out of nine essay topics, in a sample of 1025 can- 
didates, three of the essays were selected by four, six 
and eight candidates respectively, and a fourth topic, 
by only 18 candidates. One of the remaining five 
topics were chosen by all the remaining 989 can- 
didates in approximately equal numbers. It is easy to 
say, after the event, that only these five topics need 
have been set, yet there was no apparent reason why 
the other four had been largely avoided. 

The second problem introduced is inescapable. 
What exactly does an essay measure? In an English 
Language paper there is at least a full answer: the 
ability to present an accurately written piece of 
English prose, based ona stated theme. The marking 
is a matter for a skilled team of well-trained 
examiners, and with a continuous control of 
standards by a chief examiner, the results are 
sufficiently accurate to be largely acceptable to the 
teaching profession and the general public. Yet some 
of the criticisms contained in the above-mentioned 
report by Grieve of the paper as it is at present, show 
that such accuracy is exceedingly difficult to achieve, 
and it is his conclusion that the present paper does in 
fact not measure the ability to use English as a 

spoken, or in general, as a written medium of 
communication. There is, he argues, a kind of 


4 


‘examination English’ artificially engendered in the 
schools by the nature of the final paper. Steps are 
being taken to modify the paper, including the 
introduction of an objective test, so that this 
criticism can be met. The problem of accurate 
marking remains. 

In other subjects, the essay-answer is less defen- 
sible. Here an essay seems to involve a varying 
combination of factual knowledge, with some 
interpretation of facts, and less often, some in- 
ferences from facts, with an exercise in prose con- 
struction in the English language. As has already 
been stated, badly written, badly expressed ideas can 
Vitiate the value of the material offered. Were the 
candidate permitted to list his facts and arguments 
as a sequence of numbered sentences—or perhaps in 
note form—thus minimising his need to use more 
extended forms of English, he might well be able to 
reveal in the time allowed a much greater knowledge 
of the subject. There is therefore inevitably a double 
standard used in the marking of other subjects in 
essay form. Although there may be no specific 
allowance or penalty for the quality of English, no 
examiner can assess knowledge which the candidate 
possesses, but has failed to present adequately. 
Significantly, essay marks in all subjects rarely 
achieve the maximum, and equally rarely are zero. 
No essay is ever assessed as perfect, yet most can- 
didates can, however weak, write some sort of 
garbled account, which an examiner searches 
through hopefully, and can find enough for a few 
marks. As a result, the spread of marks is low in 
relation to the full mark total and consequently the 
reliability coefficient associated with the mark 
distribution is also low. (In Part 6, the statistical 
formulae are presented which amplify these tech- 
nical terms. For the present is suffices to say that, in 
mark ranges typical of essay answers, the standard 
error of a mark is high so that in practice very few 
effective separations between the various qualities 
of candidates are possible.) 

_ The final problem of essay answers is again 
involved in the marking procedures, It takes many 
Weeks to mark all of the scripts from a large examina- 
tion and this can proceed only after time-consuming 
meetings of examiners at which the Chief Examiner, 
who has set the paper and prepared a draft marking 
Scheme, co-ordinates the standards of his team of 
examiners by discussing some selected scripts, and 
by watching their efforts on a few new scripts. Only 


when he is satisfied that they have grasped the 
standards laid down, does he allow them to continue 
marking their allotted scripts. During this time, 
often two or more days in duration, the mark scheme 
is amended, often quite drastically in the light of the 
problems presented by the actual scripts submitted. 
Even an experienced Chief Examiner cannot readily 
predict in advance of the examination just how the 
candidates will respond. Hence there is always the 
need to amend a mark scheme. A further delay is 
needed, since usually a selection of the scripts from 
each examiner is sent to a moderator who controls 
the overall standards, and until he agrees that the 
work is adequate, the examiner still cannot proceed. 
Despite this elaborate and expensive procedure 
however, it is not uncommon to find at least one 
examiner who has deviated so far from his initial 
standards that all of his scripts need scaling, or 
even re-marking. All of these devices are needed to 
ensure the most accurate marking possible, so that 
each candidate can receive a just assessment of his 
work. Even so, occasional blunders can occur, and 
unmarked questions, or wrongly marked material, 
may only come to light if the script happens to fall 
under the eye of the senior examiners and Chief 
Examiner at the final awarding stage, where border- 
line cases are selected and carefully reviewed yet 
again. The problem of finding teachers, able and 
willing to be trained as examiners, is always present. 
The problem is heavy enough in the United King- 
dom, where at present nine examining boards 
compete for the teams they need. The introduction 
of many C.S.E. boards has added to the problem, 
but has been overcome by using large numbers of 
teachers to mark their own work. 


A. The advantages and limitations of 
objective tests 

There are four basic advantages, which can be listed 
as 

(a) specific content and coverage 

(b) precise problem posed i 

(c) rapid and accurate marking 

(d) purity of content 


(a) Specific content and coverage 
Each question in an objective test is usually known 
technically as ‘an item’. Each item covers a specific 


point of the study under test. In Part 2, where test 
construction is discussed in detail, it will be seen that 
part of the technique of preparing a test is a detailed 
analysis of the examination syllabus. Items are then 
carefully written, revised, and rewritten until the 
specific point in the syllabus to be tested by this item 
is covered. In this way, a large number of items can 
be compiled, each covering a specific feature of the 
syllabus, not only its factual content, but the depth 
of reasoning appropriate to the level of the can- 
didate’s standard. These items assembled as a test 
thus cover the whole syllabus, the relative impor- 
tance of each section being weighted by a propor- 
tionate increase in the number of items devoted to it. 


(b) Precise problem posed 


An essay-type question, to provide some flexibility 
for the candidate, must be sufficiently broad to givea 
sufficient basis for a response of adequate length. 
Equally, a virtually identical question can be set at 
different levels, the details expected in the answers 
increasing as more advanced levels are reached. The 
average candidate often finds difficulty in assessing 
the detail required and is thus inclined to include far 
too much, in case some extra fragment earns an 
extra mark or two. Significantly, as examiners aim 
at shorter, more precise answers, the content of each 
question shrinks and their number increases. The 
short-answer questions in one section of the Science 
papers at the G.C.E. Ordinary level illustrate the 
final point of this trend. Such questions are so nearly 
objective in form that there would be a great gain in 
efficiency and no loss in their value if the final step 
to a fully objective form were taken. 

The essence of an objective test item is that it asks 
a single, precise question to which there is a uniquely 
correct answer. It is quite possible to write a group 
of five or more items on the same theme, each 
exploring a different facet. By his answers, the 
candidate’s grasp of this theme can be accurately 
assessed. This grasp can, by choice of items, cover 
his factual knowledge of the theme, his comprehen- 
sion of the principles involved, and his ability to 
deduce from these principles. Since the range of 
items covers a set of precise points of study, every 
topic in the syllabus occurs. There is an end to 
“question-spotting’ since all topics will appear and 
the huge diversity of possible items precludes the 
chance of guessing actual items. 


(c) Rapid and accurate marking 
Clearly the construction of a suitable test of 50 to 
100 items is a long and technically complex task. It 
is a task undertaken well before the examination, 
but certainly takes not much longer than the present 
procedure of setting a written examination paper, 
where discussions by correspondence between the 
various people involved can take a full year in 
advance of the printing order being given. The 
reward for the difficult task of constructing a test 
however, comes in the marking. Suitably arranged, 
and with special equipment available, the marking 
can be dealt with by machine at a very high speed. 
There is nothing inhuman or unfair about this: 
examination marks should be quite impersonal. 
Most examiners are aware of their sympathetic 
reaction to a struggling but clearly weak candidate, 
and of the refreshing change and corresponding 
tendency to overmark when an excellent script 
appears in a group of mediocre ones. This human 
failing—and it is a failing in an examiner—leads to 
the usual ‘benefit of the doubt’ decisions. The very 
existence of this doubt reflects the zone of error 
implicit in all essay marking. On the other hand, 
each item in an objective test is either right or wrong. 
The usual procedure is to award one mark for each 
correct item, but more complex procedures are 
possible to deal with special problems. Most 
methods of scoring can be marked by a variety of 
techniques. Even if no special equipment is available, 
tests can still be marked rapidly by hand. Either the 
answers can be arranged in a column on the side of 
the question paper, or a special answer sheet can be 
used. Sometimes a cut-out grille is placed over the 
paper, through which the candidate’s answers can 
be seen in the spaces allowed for them. The labour of 
marking is not excessive—usually a correct answer 
is ticked, and wrong ones ignored. No specialist 
knowledge is needed, thus conserving the trained 
examiners for essay marking, and competent people 
of clerical grades can mark such tests quite readily. 
One implication of this form of marking is that 
the score accurately measures the candidate’s 
knowledge. If two candidates of equal ability have 
each covered 60% of the syllabus, they are very 
likely toscore equal marks, although their scores will 
be built up from different questions, This is simply 
because as the full syllabus is covered, each will find 
60% of the items within his coverage of study. 
A regular complaint about objective tests, 


especially of the multiple-choice pattern, is that 
candidates can guess when they do not know a 
result, and that this vitiates the score. There are two 
answers to this: firstly, if necessary, a correction can 
be made to the scores to allow for guessing. 
Admittedly, there is no exact way of allowing for 
this in individual candidates, but the real problem 
of guessing is not in that aspect. The essential 
difference is between a group who have been advised 
not to guess because a penalty for wrong answers 
may be imposed, and another group encouraged to 
complete unknown items by guessing. It is this type 
of guessing that can be evaluated and corrected. 
The second answer to guessing is that the small 
general inflation of scores by guessing is unimpor- 
tant; it merely alters the level at which decisions are 
made on the results. If the pass-score is held well up, 
and there are statistical criteria available for 
deciding this, the chance of a candidate passing by 
guessing is very low, much lower by far than such a 
chance at present offered by the normal error level 
in essay marking. 


(d) Purity of content 
This follows from the precise wording of items. 
By using the Simplest wording in stating each 
problem, there is a negligible chance of even a weak 
candidate being unable to understand the item even 
if he cannot solve it. Further, by narrowing the 
content of each item, there is no overlap problem; 
that is, there is no chance of the lucky knowledge of 
another subject accidentally helping in the solution 
of an item. This essential content purity is deliber- 
ately sought in the test at the construction stage, for 
without it, the accuracy of the resulting score is 
reduced, and ultimately the validity of the test is also 
reduced. A test which attempts to measure several 
disciplines simultaneously is a bad test. Of course, 
for ease of administration two or more tests may be 
included in the same booklet, but each is distinct, 
separately timed, and Separately scored. This is 
quite different from the general muddle of items in 
some of the so-called ‘intelli gence tests’ and ‘aptitude 
tests’ which, quite rightly, have called forth con- 
demnation from teachers, Do not, however, judge 
all tests by these badly designed ones, produced 
commercially more for parents than for teachers, 
many in the United Kingdom being alleged to assist 
in ‘coaching’ for the ‘11-plus’ transfer examination. 
Such tests do a great disservice to the genuine 


research of national bodies, aimed at more exact 
and, hence, more just examinations. 


(e) Limitations of objective tests 


It is never seriously claimed that objective tests are 
the only answer, or always the best answer, to the 
problem of perfect examining. The most valid 
criticism concerns the lack of any requirement for 
essay writing. This is essentially a skill, and tests 
have been designed to measure this skill, without 
calling for an actual essay to be written. Such tests, 
however, should be regarded as research exercises, 
proving only that essay writing is a measurable skill. 
Nevertheless, objective tests do not provide a 
measure of ordering and selecting material, except 
by the use of highly sophisticated grading which 
could only be used with advanced candidates. 

All examinations are bound to affect the school 
curriculum. This is inevitable and need not be a 
disaster. Indeed, skilfully set examinations can be a 
service in a community where many of the teachers 
lack high qualifications. The guidance they provide 
as interpretations of a syllabus at least give a target 
for them to aim at. This ‘feedback’ into the schools 
would, however, be quite disastrous if the adoption 
of objective test examinations led to mere coaching 
for this sort of test, replacing conventional teaching. 
It should not be overlooked, however, that the 
existence at present of essay-type papers has also 


caused undue emphasis to be placed in many schools 
on the regular weekly essay in each of the main 
subjects. Many such essays submitted are merely 
rewrites of standard text-book passages and there 
is good reason to doubt the value of them. However, 
traditions die hard and it is certainly true that it is 
absolutely essential to retain some examination 
papers in the essay form, coupled with an objective 
test, to ensure that practice in prose construction and 
general essay writing in the schools is continued, 
voluntarily or otherwise. Since ‘feedback’ exists and 
will always exist, whatever examinations are 
proposed, it is the duty of any examining body to 
ensure that such ‘feedback’ produces positive and 
useful effects in the schools. The fundamental truth 
is, that given sound teaching, the final form of the 
examination presents no real problem to the gifted 
student. The student on the border-line is however, 
more justly served by a paper whose marking is 
accurate. 

It should be emphasised therefore that where an 
objective test cannot measure adequately a skill or 
some acquired abilities, such as for example, the 
technique of arranging data and evaluating material 
in an ordered way, the conventional type of paper 
must be retained. No examining body, however 
much of its effort is devoted to objective testing has 
proposed, or would ever be likely to propose, the 
complete abolition of written papers, 


Part 2. Objective Test Construction Techniques 


1. Forms of objective tests 

The most vital criterion that every test item must 
satisfy is that there must only be a single correct 
answer. There can be no relaxation of this rule and 
all subject specialists must agree that there is this 
answer, and only this answer. Otherwise the 
marker’s assessment of the quality of an answer 
enters in, and this destroys the objectivity of the 
item. 


(a) Open-ended or ‘supply’ type of item . 
Basically, there are two types of item. The first is the 
open-ended, or the ‘supply’ type of item. This 
requires a candidate to study a specific question and 
write down ina prescribed place his answer. Usually 
the answer is a single word or figure, but occasionally 
a slightly longer response is required. In such cases 
the number of words needed is often indicated. In 
this type of item it is usual to arrange the spaces for 
the answers at the end ofa line, or in prepared boxes, 
so that in marking it is easy to find the answers. 
Often a cut-out grille is designed, which, placed over 
the printed test paper, shows the candidate’s 
answers above or below the correct response 
printed on the grille. The papers, of course, need to 
be marked by hand and are suitable for class-room 
use, or in examinations where the number of 
candidates is limited. mA : 

They have, however, objections to their use. 
However hard an examiner tries to cover all possi- 
bilities, it is not uncommon to find more than one 
acceptable answer. In such cases, a decision is 
needed on whether the item is marked as correct, and 
this implies a skilled marking team able to make such 
decisions. This, in any major examination, uses up 
the supply of trained examiners, who are more 
usefully employed on the far more delicate task of 
marking essay papers. In class-room use, however, 
where a teacher can make his own judgements, and 
later discuss the items with the class, they are of 


considerable use. 


The kind of problem presented can be illustrated 
first of all by two very bad questions: 


‘What is the value of 7?’ Does the examiner 
expect 3, 27, 3-14, 3-142.... There is no exact 
answer. 


‘Gold was first discovered in...’ Is a country, 
or a year, required? 


A more involved case illustrates the problem of 
plausible answers and the need for decisions on other 
answers: 


‘What insect spreads Malaria?’ Suppose that 
both ‘Mosquito’ and ‘Anopheles’ are agreed as 
correct. How far wrong can the spelling be, if 
at all, before the answer is called wrong? Does 
*moskeeto' or ‘onoflees’ show sufficient know- 
ledge, but poor English? 


The supply type is certainly useful in Elementary 
Mathematics, and in the calculation or formulae 
type of question in Physics and Chemistry, where 
simple problems are assessed entirely on the correct 
answer. One of the common techniques in testing this 
type of item is to use an open-ended test, and 
examine the distribution of wrong answers, which 
are often due to common errors of method. These 
wrong answers can then be used as distractors in the 
multiple-choice type of item. 


(b) A comparison of ‘open-ended’ and multiple- 
choice items 


There is often quite a serious argument offered that 
a supply-item is preferable to a multiple-choice 
item, since in the former the candidate has to recall 
some material and provide his own reply, whereas in 
the latter, since he knows that the correct answer 
lies before him, he may be sufficiently reminded by 
its presence to answer correctly, and otherwise be 
induced to guess. It is also sometimes asserted that 
it is educationally bad for a candidate to see before 


9 


him wrong answers for, having chosen one of them, 
this wrong answer may be fixed in his mind as a true 
fact. Since these arguments are put forward in good 
faith, they deserve close discussion. 

First of all, the requirement of recall from the 

candidate is correct, but the implication is illusory. 
The need to make supply-items absolutely foolproof, 
so that there can be only one answer severely limits 
the scope of such items in a large-scale examination, 
so that the recall required is so basic that it is difficult 
for them to provide a test of higher skills or indeed 
go much beyond purely factual levels. There is also, 
as mentioned above, the wastage of trained examin- 
ers in the marking of these items. Secondly, the 
presence of the correct answer among others in a 
multiple-choice item does not necessarily induce the 
candidate to guess—this has been discussed in 
Part 1—and, particularly for weak candidates, all 
of the answers will, in a good item, look equally 
plausible. Definite knowledge is needed to select the 
correct answer. Thirdly, the presentation of wrong 
material in this form during a relatively short period 
of examination does not allow any real time for the 
items to be memorised. Indeed tests have shown that 
immediately after completing a test of some 50 
items, not a single candidate from a group of 75, 
including some very good students, could write 
down any single item from memory, let alone recall 
any set of five responses to an item. The nearest any 
of them came was in the case of two of the best boys 
who successfully paraphrased up to five items to 
which they could recall their own correct answers, 
but could only recall either one or two of the wrong 
answers. The poorer candidates, although most 
wrote down something on their questionnaire, 
could remember neither questions nor answers 
accurately following the test. This suggests that the 
tisk of learning a wrong answer by memorising an 
item is non-existent. Indeed more detailed research 
has come to the identical conclusion, even where 
candidates have been told to try and remember the 
questions during the test period. 

When attempts are made to increase the difficulty 
of supply-type items, the usual procedure is to 
Construct a passage, or a connected argument, or a 
mathematical or scientific proof or calculation, with 
key words or figures omitted. The candidate needs 
to comprehend the piece as a whole before valid 
entries can be made. The problem here usually, is to 
omit sufficient material to test the candidate’s powers 


10 


of reasoning or comprehension, without making the 
residue so vague or ambiguous that the intentions 
of the examiner become a greater problem than 
providing the answers. A classic case of this occurred 
in the writer's earlier experience where a compre- 
hension passage was set in Arabic, the mother 
tongue of the students concerned, in a Geography 
test. Omissions were made of technical words, which 
the candidates had to supply. To the consternation 
of the teachers involved in the marking, it was found 
that a virtually different set of answers made com- 
plete sense because, partly due to the omission of 
Short vowels in printed Arabic, the passage was 
capable of a different interpretation. The intentions 
of the examiner were thus completely evaded, yet it 
was impossible to do otherwise than award marks 
for the second version. 


(c) Multiple-choice items 


The multiple-choice item is the second basic type, 
and is used very widely. The basic form consists of 
an element, called the ‘stem’ and containing, or 
implying, a question to which several responses, or 
'answer options’, are given. The single correct 
Tesponse has to be indicated by the candidate by 
marking, underlining or recording the number of his 
choice. The two examples below are not intended to 
Suggest more than the way that items are presented. 
Many more examples will be given in Part 3. In fact, 
the use of multiple-choice items can be so varied in 
its purpose, so subtle in its approach, that they can 
be far more searching to the candidate even though 
the answer lies before him. 


Two simple examples of the multiple-choice form 
of item are: 


Example 1: The sum of 210 and 169 is 
1. 369 2.41 3.379 4, 279 5. 269 


This, of course, is only suitable at very elementary 
levels, but each of the four wrong answers could be 
obtained by a candidate who makes the common 
error of failing to carry forward, Forty-one is the 
difference of the numbers and implies a common 
confusion in young children of the meaning of 
‘sum’ as an addition, with ‘doing sums’, i.e. cal- 
culations in general, 

Here is a more complex example, involving simple 
reasoning. Note that the level of English needed D 
very low, merely the concept of ‘fast’, ‘faster than 


and their opposites, and the idea of a ‘start’ in a 
race. With school sports held even at junior levels, 
these ideas are well within their comprehension. 
Hence the real test is one of reasoning, and the 
English involved is no handicap, at this level. 
Example 2: John can run faster than Peter or 
Jane. Jane cannot run as fast as Peter. 
Which sentence below is true? 
1. John can run faster than Peter, but not as 
fast as Jane. 
2. Jane and John can both run faster than 
Peter. 
3. Jane is the slowest runner. 
4. John is the fastest runner, and Peter the 
slowest. 


5. In a race, Peter will always win if he is 
allowed to start before John, but Jane could 
never win. 

The simplest case of a closed-type item is the 
true/false choice. This has some value either in very 
long, very low-level tests, where the young child can 
readily grasp the idea of one answer out of two, or 
at very advanced levels indeed where an extremely 
high score is needed for selection to a further form of 
testing. It has been used, for example, to select 
high-grade Civil servants where a long and complex 
passage on Economic Theory was followed by 20 
true/false type of items, such as 

‘This passage was written by an Englishman’ 

Yes/No. 

m. theory involved is certainly pre-1914* 

Yes/No. 

and so on. If a candidate scored 18 or over, he was 
then interviewed. ; 

At the more usual levels of either Secondary 
selection or G.C.E. ʻO’ level however, this simple 
form is too wasteful, for guessing alone could lead 
to half-marks and thus the effective scoring range 1s 
very limited. The more choices there are, the more 
searching the question can be made, but in practice 
there is a limitation imposed by the need to make all 
answers sufficiently plausible. ]f a response is not 
used, save perhaps by a few very weak candidates, 
who may well be guessing, it is clearly pointless to 
use it. In practice, five responses are usual, with 
perhaps only four in some topics where it would be 
difficult, or artificial, to introduce an extra option. 
This adequately diversifies the item, and sufficiently 
reduces the element of guessing. 


A3 


(d) Matching blocks 


There are other forms and patterns of objective test, 
many of which are very useful for school or class- 
room although in large-scale examinations, where 
analysis and marking is handled by machine-scoring 
methods, the basic multiple-choice pattern is used 
almost exclusively. 

Matching blocks are one useful type. There can 
be for example, a list of countries on the left, and a 
list of, say, capital cities on the right which have to 
be correctly allocated to their countries. The use of 
‘extra’ cities prevents the last one being completed 
by elimination: 


Example: 
NORWAY Melbourne Washington 
AUSTRALIA Rome Lagos 
ITALY Florence Accra 
U.S.A. New York Canberra 
GHANA Oslo Stockholm 


(e) Maps, diagrams, graphs etc. 

Diagrams, maps and sketches, providing there is 
only one clear correct answer pattern to them, also 
add variety and interest while remaining objective. 
Examples are given later. 


2. Types of test function 
(a) Speed tests 


Some types of test consist of very large numbers of 
very easy questions, frequently of similar types. The 
object is to test pure speed. The questions are so easy 
that it is not anticipated that more than a very small 
number will be wrongly answered. The score for each 
candidate is therefore, simply the number correct in 
a limited time. The intention is that nobody com- 
pletes the test. This is then a pure ‘speed’ test. Its 
most frequent application is in elementary arith- 
metic, testing speed and accuracy in simple adding, 
subtracting, multiplying and dividing of small 
numbers. The items can get harder as the test goes 
on, but there should be no reason at all, given time, 
why every candidate could not score 10097. This 
kind of test is only objective if a total score of correct 
answers is decided by the speed of working and not 
by knowledge of principles and methods. It is some- 
times used to select the ‘quick’ thinker, or to test 
sheer ‘staying power’ over a mentally exhausting 


11 


exercise. It has rather limited use in examinations, 
but is ideal to ‘ginger up’ a class after ample practice 
at a more leisurely rate. 


(b) Untimed tests 
These are the exact opposite. The questions are 
usually fairly difficult and often become more 
difficult as the test proceeds. They are used often in 
individual testing by psychologists. Their function 
is to allow the candidate unlimited time (within 
reason) to answer until he cannot attempt any more 
questions. If they are of objective pattern, there may 
at the end be extensive guessing shown up by an 
erratic pattern of almost random correct responses, 
and if an omitted answer, or blank is allowed (a poor 
idea) there may eventually be a long string of 
blanks. Here again, total knowledge is measured by 
having questions of increasing difficulty, and yet a 
large number of them, in gently increasing difficulty 
levels. For large-scale use they are rather impractical 
since in administration some time limit is inevitable. 
For individual cases, time may well be available, 
since the psychologist possibly needs to diagnose 
very carefully the mental and physical attributes of 
a possibly backward or emotionally disturbed child. 
The type and scope of such tests is almost infinite in 
variety but they fall outside the present terms of our 
discussion. 


(c) Power tests 


This is the type of test with which we shall be mainly 
concerned, in its multiple-choice form. A power test 
is one which is designed to measure the attainment 
or the level of knowledge achieved at some defined 
point. Their main academic use is in measuring 
attainment in school subjects, but it is possible to 
construct such tests in various practical skills also, 
and these have indeed been used in the services both 
in the United Kingdom, and more extensively in the 
U.S.A., to measure both the skills of entrants to the 
Services to ensure their best training needs, and later 
to assess their success with the training. 


(d) Prediction tests 

These are aimed at providing basically an order of 
merit of the candidates which corresponds with their 
order on one or more criteria. For each criterion 
involved, an extra test is required and a Separate 
order of merit must be obtained. The accuracy of 
the orders of merit obtained from the tests depends 


12 


on the validity of these tests in measuring the criteria 
involved. These criteria need accurate definition 
initially and should preferably be uncorrelated. 
Then the separate tests can be made ‘pure’ so that 
each test measures one criterion only. If however 
the criteria are correlated, which may often be the 
case in academic subjects, the tests may involve 
more than one criterion and the resulting analysis is 
consequently complicated. The purity of tests is 
achieved by item selection based on the statistical 
evaluation of each item and, necessarily, perfection 
is impossible even with the use of a computer to 
assist in the analysis. With both careful research and 
evaluation however, some very fine tests have been 
evolved during the last few years. A single test 
aimed at estimating an order of merit on a single 
criterion is a simpler task and the major requirements 
are, a plentiful stock of well-written items, a care- 
fully chosen sample of candidates for a trial of the 
material, and a selection of the final items after 
accurate item analysis. 


(e) Selection tests 


The objective here is usually simpler. The result of 
the test is to divide the candidates into two groups; 
the pass group and the fail group. Here the objective 
of selecting items for a final test is to ensure accuracy 
at the line of demarcation, whereas the exact order 
of merit at the extremes of the scale are not so 
essential. Naturally the problem is complicated if 
more than one dividing line is needed—for example, 
grades of excellent, good, credit, pass and fail may 
be needed for some selection tests. In such a case the 
order of merit needs to be as accurate as possible 
over the greater part of the range of ability of the 
candidates, and this imposes stricter limits on the 
statistical values obtained for each item used. 


3. Constructing a test schedule 
(a) Study of the test syllabus 


The first task in constructing a test is to decide what 
is to be tested. Unless the test is to cover only a very 
limited field, as for example, a short class-room test 
on two weeks’ work, it is essential to write out à test 
schedule. The first step in this is to make what is in 
effect, an amplification of the examination syllabus 
to be covered, whether this is an external public 
examination, or an internal, school examination- 


Thus it is a list of main topics, each divided into 
sub-topics. It is then necessary to ask, what the aim 
in teaching each topic is as it is covered. At very 
elementary levels, the major point will be in teaching 
basic facts, but as higher stages are reached, there 
will need to be some appreciation of general 
principles, of the relationship between facts, of cause 
and effect, of deductions and inferences from the 
principles laid down, of some evaluation of attitudes 
or ideas, and possibly, at advanced levels, some 
development of hypotheses from the material 
studied. All of these developments over and above 
the basic learning of facts are capable of being tested 
by multiple-choice tests. Of course, such items are 
not so easy to construct as those demanding only 
factual knowledge, and both patience and practice 
are needed. Nevertheless it is essential to consider in 
turn each sub-topic of a syllabus, and decide which 
of the above fields of learning is required for this 
topic at the level being tested. A two-way table is 
thus built up, with the sub-topics down the left-hand 
column, and the various aspects mentioned above 
across the top. Initially a tick is placed in each 
column wherever that aspect of learning 1s needed 
for each sub-topic in turn. The aspects above are 
written in their order of difficulty, so that it would 
be usual to find for each sub-topic a tick in the 
‘basic fact? column, followed by further ticks in 
adjacent columns, each row of ticks ending under 
the most difficult aspect appropriate to the topic. 
The more difficult elements of the syllabus, par- 
ticularly at the elementary stage may never go 


beyond the purely factual level. 
Once this stage has been reached, two further 


points need to be considered. Firstly, how many 
ticks are there? There may already be as many as 
40 or more. This immediately implies that even if 
there is only a single item associated with each tick, 
there will already be up to 40 items needed. Hox 
long is the test to be? Depending on the difficulty o 

the items and the subject involved, it is not un- 
reasonable to expect candidates to answer from 
40-60 multiple-choice items in one hour- 


(b) The weighting of the test syllabus l i 
The second further task is to weight each tick by its 
relative importance. Usually this means that whereas 
the most difficult aspects of each topic remain ata 
level of one item only, as the aspects move back 
through the table towards the left, towards the 


*basic fact’ level, the weight tends to increase, 
although there may still eventually be only one item 
per topic on the basic fact level itself. 

This process increases the number of items, each 
tick being replaced by a number from 1 upwards. 
The total number of all items can now be counted. 
This is the final test specification. When a teacher 
evolves such a specification for the first time, he 
frequently finds that he has gained a much greater 
insight into the content of the teaching programme. 
Certainly the specification itself often leads to a re- 
assessment of both the aims in the teaching of a 
subject and the relationship between its elements. 

In constructing the schedule, an examining body 
has the advantage of drawing on the experience of 
several senior and well-qualified teachers who 
together devote several days hard work to the task. 
Similarly, if a teacher wishes to evolve such a 
schedule for school use, the assistance of his 
colleagues in the subject, and of other friends from 
neighbouring schools, not only makes the task 
easier, but the diverse opinions encountered make 
for eventual accuracy and clarity. Eventually of 
course, the existence of a common test schedule in 
several schools in any area can help to standardise 
the coverage of the subject to the advantage of the 
weaker schools. It is by no means merely a dry 
academic exercise. 


(c) Example of a test schedule 


To illustrate the general principles involved, a very 
small portion of a test schedule developed by the 
writer some years ago for use in an experimental 
primary school is given. 


Subject: Objective: 
Arithmetic Special programme for 
high-intelligence groups 
Level: 


Primary school 


Topic: 
Decimal Notation 


MAIN HEADS SUB-HEADS 


Place Position Relation to fractions 45, 155 etc. 
Effect of multiplication by 10, 100. 
Effect of division by 10, 100. 
Squares and effect on position. 


13 


Significant As applied to: 

figures Whole numbers, numbers and 
decimal places, decimal places 
alone. 

Addition and Preserving place position. 

Subtraction Rounding to significant figures. 


Linear measure: 
English, metric. 


Area measure: 
English, metric. 


Applications 


Relationship to percentages of 
numbers, money, quantities. 


To illustrate the second stage of the schedule, the 
first topic is further broken down into the two-way 
table discussed above. Three aspects only of learning 
were involved: 


(A) the basic facts and practice therein 
(B) deductions from these facts 
(C) evaluation of related ideas. 


The two-way table starts: 


Learning Aspects 


Total 


Topic Place Position 


Sub-Heads: Relation to fractions 
Multiplication by 10, 100 
Division by 10, 100 
Squares 


There were thus thirteen items written (not all 
multiple-choice) covering this one topic. In the full 
analysis, there were a total of 75 items covering 
decimal notation. Similar schedule elements led to a 
total of 450 items covering the whole two-year 
programme. These items were kept secret so that 
they could be used on several occasions. Eventually, 
the test was divided into three sections of 150 items, 
one being administered in short sections immediately 
following class tuition by the teachers involved in 
the scheme, the second being administered in longer 
sections towards the end of the first year, and the 

final section being similarly reserved to the end of the 
second year. The more difficult aspects were reserved 
for the final tests, and as a result, both of the 
teacher’s experiences with the programme, and an 
analysis of the items used, the specification needed 


14 


some rewriting, and some 120-odd items were re- 
written or completely replaced. It is not the intention 
in showing this detail, of frightening teachers away 
from the task, but it is intended to show that the 
schedule does need some experience of teaching, 
coupled with a real interest in analysing the content 
and aim of teaching a subject, with the reward of a 
far better appreciation of the implications involved 
in satisfying an examination syllabus. In any case, 
without a schedule, even if at first the detail is not 
fully developed, the eventual writing of items often 
lacks both precision and purpose. N 

Once trained or practised in item writing, it is 
possible to write items first and allocate them to a 
specific point in a schedule afterwards. This is not 
always easy however, since many items so written 
tend to be over-complex and contain aspects of two 
distinct topics. This itself is not necessarily wrong in 
more difficult items, but itis easier and more likely to 
be effective if each item is written specifically to test 
one point in the schedule. 


4. The training of item-writers 


(a) This section illustrates the way an examining 
body will arrange training courses for teachers in 
item writing. Individuals can of course only teach 
themselves by practice and trying out their work in 
class tests. However the description below may help 
Teacher Training colleges for example, to include 
some training in their programme and will show 
individual readers what is required in learning 
the techniques of item writing. 

It is probably true that most competent teachers 
can be trained to write effective items in their own 
fields, whereas indifferent teachers are far less likely 
to succeed. There are, of course, exceptions, as 
there are in the field of essay marking, where an 
exceptionally able teacher has to be taught to avoid 
harsh decisions on average scripts. 


(b) Grasp of subject needed 


The first essential for a trainee writer is a sound 
grasp of the subject, with several years teaching at 
the level of the examination for which the test items 
are required. This grasp needs to go beyond the mere 
mechanics of teaching and must include an apprecia- 
tion of the purpose of teaching the subject, of its 
educational value in society, and of its relationshiP 


to other disciplines. In fact, the teacher needs to 
remember and apply the studies which make up a 
major part of any good training college programme. 
This full appreciation is essential in order to under- 
stand the details of a full test schedule, particularly 
if the teacher wishes to offer suggestions for its 
amendment. 


(c) Study of standardised tests 

The initial stage of training is to be shown a wide 
variety of objective tests in the subject, with others 
from different fields. A commentary is given on their 
differences and various efficiencies. As these have 
already been used, statistical details of their per- 
formance will be available for later study. One or 
perhaps two tests are selected for detailed study, 
item by item. Initially, trainees are asked to do the 
test by marking the correct answers. If they make 
any mistakes, it would imply a defect in the item, 
with a wrong answer proving too strong a distractor. 
The various alternatives are discussed with the 
reason for their presence; by no means are they just 
any wrong answers written down at random. Since 
the teachers are being trained mainly so that they 
can contribute items for future use in large-scale 
examinations, there is an emphasis on multiple- 
choice items, but other types of item are also 
exhibited and discussed. This stage of the training 
must be confined to perhaps two full days in a 
seven-day training period, which is usually all that 
can be allowed both by the teacher’s duties and the 
financial allowances that can be made for the course. 
A three-month training course on the other hand, 
would allow up to four or five full days for the study 
and criticism of existing tests. 


(d) Initial writing trials 

Ina short course, the next stage, lasting up to three 
days, is for the trainee to write several items. 
Initially he may do this without reference to any 
specific topic, later the scope of the item can be 
limited. This enables the instructor to see first of all 
that the general principles have been mastered, and 
later that the trainee can see how to test a finer detail 
in an item. Naturally, in three days no trainee how- 
ever good can become an expert, but by submitting 
attempts to expert criticism, with selected items used 
for demonstration to a wider section of the course, 


the more obvious pitfalls can be avoided and a 
foundation laid for further practice. In fact, most 
item-writers improve steadily over several years, and 
even to achieve between 50% and 60 % effectiveness 
in items is highly commendable. Nobody, no matter 
how expert, can achieve 100 % effectiveness. 

A longer course naturally allows a writer to 
continue to practice item writing over several weeks 
while undertaking more detailed technical study in 
the main course. The usual target at the end of a 
three-month course is the construction of a test of 
some 50 items, covering a prescribed section of an 
examination syllabus. 


(e) Qualifications needed for an instructor 

The instructor for a short course must himself be an 
expert writer of several years’ standing. It is im- 
possible for him to be a specialist in all subject fields, 
so either the course is confined to that subject in 
which he is a specialist or he is assisted by other 
writers in different disciplines. He will also be 
assisted, if need be, by a ‘test technician’. This rather 
ungainly term, borrowed from American test jargon, 
describes an individual who has a good knowledge 
of assembling a test to meet a specification, although 
he is not necessarily an expert writer of items. He will 
usually have a good knowledge of the meaning of the 
statistical data available on each item, but may not 
necessarily be trained to work out such data. His 
duties are wide rather than specialised. He may need 
to be a chairman of a board reviewing items, or the 
editor of the general instructions to be given to both 
invigilators and candidates, and may have to proof- 
read the printer's first efforts. Another assistant to 
the instructor may be a statistician, if neither the 
instructor nor the test technician is sufficiently expert 
in this field. 

The last two days of a seven-day course are 
devoted to the elementary stages of such processes 
as field trials, scoring methods, item analysis 
principles, and the criteria for selecting a successful 
item. It is not essential for every writer to be an 
expert in these aspects, but it is important that he 
understands the principles involved in the technical 
handling of the basic items after the first writing. 
This enables the trained writer to anticipate likely 
objections to an item and gives him some insight 
into the selection of the final items. In any case, most 
teachers are frightened away easily by the mere 


15 


mention of statistics, and it is quite unreasonable in 
a short course to do more than outline the principles 
involved and avoid both figures and formulae. 

In a longer course, naturally, these processes of 
field trials and subsequent analysis are dealt with 
fully, since the trained writer emerging from such a 
course will normally expect to devote quite a large 
period of the academic year to test construction, and 
may be expected to conduct short training courses 
and to act as a focus for all interested teachers in 
his area. 


(f) Regular production of items 


Once a trained writer begins to produce items by 
commission for an examining body these items are 
regularly built into trial tests and are subsequently 
analysed. Thus the percentage success rate of 
acceptable items can be recorded and both the 
improvement in the writer and his ‘output’ record 
are available to mark his progress. The more 
successful the writer, the more likely he is to be 
asked to continue writing. A word of warning is 
needed, however. To write too many items too often 
is unwise. It is an exhausting task and after a while 
not only is there a deterioration in quality, but there 
is also a tendency to repeat oneself. An example 
occurred with the present writer some years ago, 
where one item writer who had been setting test 
papers for several years developed such an un- 
mistakeable style that the authorship of the paper 
was obvious. He naturally considered himself an 
expert, yet when it was necessary to produce a 
paper in a hurry, two ladies, neither of whom had 
earlier training, produced in one day a paper 
modelled on a mixture of several earlier papers. 
Ultimate analysis proved that this paper gave a far 
better result than several of the recent papers from 
the ‘expert’, who had steadily deteriorated through 
over-production. The moral is, to write a few items 
at a time, leave them untouched for a few days, and 
then look at them again. This can not only reveal 
flaws unnoticed earlier, but can lead to general 
improvements in style, clarity and brevity. The 
second look often generates a second set of items, 
for similar perusal later. Where more than one 
teacher in the area is engaged on this task, reviewing 
each other’s work is a salutory exercise, for flaws 
can often be seen in another’s items which have been 


overlooked by the author of the items. 


16 


5. Item writing 


Something has already been said above on the 
writing of items during training. The actual tech- 
nique of systematic writing however, can be varied 
in wide detail. The end-product is presumed to be a 
multiple-choice item with four or five answer 
options. Whatever method of writing is used, it is 
better to concentrate at one sitting on producing 
several items on a specific topic from the test 
schedule. Not all may survive a critical reappraisal, 
but probably one or two will. In this way, the total 
number of items needed can be built up, and the 
agreed syllabus progressively covered. 


(a) Class-room test method of compiling 


A teacher has a rather useful method available 
briefly mentioned earlier. Compile a class-room test 
of, say fifty short questions of the open-ended type 
to which the answer is a single word, or very short 
phrase, or a number. These can be progressively 
compiled during the year to test thoroughly each 
teaching topic as it is completed. Administer the test 
and mark it, for these may well serve as the normal 
class tests and provide suitable marks for the class 
records, Notice and record during the marking not 
only how many pupils answer each question 
correctly, but record all wrong answers given and the 
number of times each recurs. If before recording this 
data the test sheets are arranged in mark order, the 
top mark first, down to the lowest mark at the end, 
it is possible to record all the data needed for their 
eventual conversion to multiple-choice form at a 
single session. 

The method of analysis for a class-room test is 
dealt with in Part 4, but it suffices to say here, that 
when this recording of errors has been completed, 
each item can be ‘closed’ to form a multiple-choice 
item by adding to the correct answer the three, or 
perhaps four, commonest wrong answers. The class 
themselves thus provide the ‘distractors’ so that such 
an item is more likely to be successful than one where 


the teacher invents possible errors to provide such 
distractors. 


(b) Compilation of regular mistakes by pupils 

A second technique, also easy for a teacher, starts by 
listing common errors and misconceptions noticed 
in the work of the class—whether in verbal replies oT 


in written work. When the time comes to write some 
test items, the basic question is first selected and its 
correct answer recorded. A study of the accumulated 
list of common errors in class then often helps to 
supply alternative ‘distractor’ responses. If in- 
sufficient options come by this method, it will be 
necessary to invent others. If this is difficult, it may 
be possible to reword the item to introduce a more 
complex chain of thought and thus provide further 
likely errors. Alternatively, if these are to be tried 
out in class, an extra line can be left and the pupils 
told to write in the correct answer if it has not 
already been given. To be successful, a few items 
must deliberately be given without the correct 
answer, or the pupils will soon realise there is never 
any need to write on the blank line. The occasional 
use of the option, ‘None of the above is correct’, is 
also a useful distractor. Of course now and then this 
option must be the correct answer for its use to be 
effective. 


(c) Matching blocks as a method, leading to single 
items 


A third technique which can lead to useful items is 
the use of a matching block. An example has been 
given in Section 1 (d) above. In this form, five or 
six items are written on a similar theme, so that the 
answers to each one are of similar form. Alongside 
the block of items, there is a list of possible answers, 
there always being more answers than there 
are items. The candidate has to match the 
correct answer to each item from the block of 
answers provided. To succeed, the answers must all 
be plausible ones for each item, hence the need to 
preserve a unified theme for the items. The disparity 
of answers to items is to prevent the last item being 
answered by elimination. Of course it may be that 
the same answer is correct for more than one item, 
but it is usual in that case to warn the candidates 
that this may occur in the instructions preceding the 
block of items. If any item is almost perfectly 
matched, it should be eliminated. By studying the 
pattern of response to the remaining items, it may be 
possible to eliminate some of the answers too, where 
they are not selected save by very few candidates. 
Either the items can be left as a block, or broken 
down into separate items, each with its set of four 
or five responses. 


(d) Aural tests of foreign languages 


One example can perhaps be given of the way 
pronunciation of a foreign language can be tested 
aurally by multiple-choice items. A list of five very 
common words, in French say, is compiled and can 
be written on a blackboard or printed at the head of 
an answer sheet for each candidate. A second list of 
words is then read out, one by one, preceded by a 
question number in each case. One of the vowel 
sounds in the spoken word is the same as the vowel 
sound in one of the written words. The candidate 
has to record the letter corresponding to the written 
word with the same sound. 


EXAMPLE 


Written list 

A. belle B. deux C. dans D. sous E. eau 
Spoken list 

peur, or fleur... Write B. (As in deux etc.) 


Since five vowels do not exhaust the subject, a second 
list, such as 


A. froid B. donner C. manque D. réponse 


can be compiled. 

Once given a simple idea such as this, the ingenuity 
of teachers can often find expression in developing 
similar techniques for class-room tests. A number of 
examples are given in Part 3. 


6. Thereview of written items 
(a) Function of a reviewing team 


When a large number of items has been collected 
from several writers for use in professional examina- 
tions, it is essential that each item be reviewed by an 
independent person, or group of persons. Their task 
is to submit each item in turn to a test of key ques- 
tions. 


(i) Is the item itself factually correct and is there a 
uniquely correct answer? It has happened that a 
writer drafts an item and its set of answers. Later 
he improves the wording of the item, but fails to 
notice this vitiates the earlier answer, which also 
needs re-wording. Occasionally, the reviewing 
team disagree with the selected answer, or con- 
sider one ofthe other answers equally valid. It is 
really a question of the original writer becoming 


17 


so bound up with his work that he overlooks 
minor flaws and errors which are at once seen by 
another subject specialist coming fresh to the 
item. 


(ii) Are the alternative answers provided sufficiently 
plausible, or can they be improved? 


(iii) Do the reviewers agree with the item writer as to 
the point at which the item should come in the 
test schedule? 


(iv) Is the language of the item so phrased that it is 
fully comprehensible to all likely candidates? 
The use of unusual words, for example, should 
be eliminated. The wording should be as short, 
simple and direct as possible. 


(v) Is the item grammatically correct? This sounds 
unusual, but it sometimes happens in the re- 
writing of a question that the item no longer 
reads correctly with one or more of the answers. 
A simple illustration of this point is where the 
item concludes with the word *...an', implying 
a following noun with an initial vowel. If one 
or more of the answers do not so commence, 
the quick-witted candidate can at once eliminate 
such responses. Such basic faults can, of course, 
readily be detected and eliminated, but others 
may not be so readily detected. 


It is the final responsibility of the reviewing team, 
as subject specialists, to guarantee that each item is 
accurate in every detail. 


(b) Rules for writers 

It is useful to lay down precise rules for writers when 
submitting their material for review. This helps to 
eliminate the wastage of items and disciplines the 
writer in using a standard pattern for submissions, 
Each examining body makes its own rules, but one 
set of rules could be listed as follows: 


1. Retain no copy; your original will be filed for 
reference. Destroy by burning all earlier drafts. 


2. Use ruled paper, 8 in. wide by 5 in. deep, writing 
on one side only. Leave a 1 in. margin on the left 
for eventual binding into a folder. 


3. Number all items in order, using a new sheet for 
each item. 


4. Write the item. 


18 


5. Give immediately below it the correct answer. 
6. Then give the four (or three) alternative answers. 


7. Do not attempt to arrange the answers. in 
sequence as this is done later by random statistical 
methods. 

An example of an item as it is submitted: 

QUESTION 37 


Underline one of the words below which means 
the same as the word in capitals. (A) 


DAMP (B) 
moist (C) 
block, wet, dry, swear (D) 


(A: the rubric or instructions, which may be common 
to a group of questions. B: the key word. C: the 
correct answer. D: the distractors.) 


The final form of the item may be something like 
this, the rubric being amended to *one of the words 
on the right" to fit the changed format: 


1 2 3 4 5 
37. DAMP block wet moist swear dry 


It is similarly useful to provide both item-writers 
and the reviewing team with a set of basic questions 
to which each item should be submitted. Some of 
these questions are automatically covered by the use 
of a good test schedule and the allocation of all items 
to specific points of it, but the list does prevent 
loosely written items from escaping without 
adequate rewriting. 


(c) List of criteria for judging suitability of items 
1. Is the item as a whole realistic and practical ? 


2. Doesit deal with an important and useful part of 
the subject? 


3. Is it phrased in the working language of the 


subject, at a level of comprehension appropriate 
to the candidates? 


4. Does the item require knowledge of this subject, 
and not other subjects, to answer it? 


5. Is it independent of other items in the test? (i-e 
does the answer to one item give away the 
answer to a later item?). 


6. Is it specific, accurately stated, and the essential 
problem involved quite clear? 


7. Is it as brief as possible, yet complete and 
unambiguous? 


8. Does it contain only material relevant to its 
solution? (Unnecessary data only lengthens and 
confuses an item.) 


9. Are all the distractors plausible and expressed in 
similar grammatical form? 


10. Does it give away the answer by irrelevant detail 
or extraneous clues? 


In practice, there are simple rules in some cases 
which automatically cover many of the above points. 
For example, in a simple arithmetic test the item 
might ask for: 


37x42 and offer 1554 1556 1644 1444 1564 


Even an average pupil will realise that 7 x 2 ends in4, 
and at once discards 1556 as a result: hence replace 
1556 by, say, 1464. Thus further work is needed. An 
astute pupil will realise that 42 divides by 3 and thus 
at once eliminates 1444 and 1564 which do not 
divide by 3. The choice is thus reduced to three, yet 
the level of reasoning involved in the concept of 
division by 3 is as high as, or superior to, that needed 
to multiply two digits together, so in this case there 
is no need to eliminate options on this score. 
Rewriting often overcomes a defect, but can also 
make an item pointless. Consider the synonym 
example above. If it is rewritten as: 


1 5 3 4 9 
Moist block damp wet dry swear 


then although MoisT-DAMP is still the correct pairing, 
the distractor ‘swear’ becomes pointless and ‘block 
is hardly likely to be useful. Yet in the earlier form 
it was a successful item, since the above two dis- 
tractors served a purpose by confusion with ‘damn 
and ‘dam’. 


7. Trial testing 


(a) Reason for trial tests 

Every item needs to be tested on students who are at 
the same level of study and achievement as those for 
whom the test is designed. A subsequent statistical 
analysis of the trial results allows weak items to be 
eliminated, or possibly improved by rewriting. The 
eventual performance of the item in final examina- 
tion use can then be predicted. The accuracy of the 


A4 


prediction will clearly depend on the use of proper 
sampling technique in trial testing. 


(b) Scope of trial tests and methods of compiling trial 
tests 

Every item written, once it has been reviewed and 
amended if necessary, should be included as part of 
a trial test. This test, like a final paper, should cover 
the full syllabus and be carefully balanced against the 
test schedule. Usually several tests in each subject 
are compiled at the same time, mixing together as 
far as possible the work of several writers. These 
tests are then administered to a selected group of 
pupils whoare being prepared forafinalexamination 
on the same syllabus. 

The tests should be so timed that the narrowest 
possible time elapses between the tests and the final 
examination. This ensures that the pupils will have 
completed their studies and are most likely to be at 
the stage of revision for the examination. For one 
thing this makes it most likely that they will have 
covered all of the topics included in the test; 
secondly, because the test acts as a revision exercise 
itself, their interest and therefore their motivation 
will be high; and finally, the correlation eventually 
obtained between their test scores and final examina- 
tion marks will be reliable. When several tests in 
each subject are available, these should be mixed 
together and distributed in roughly equal numbers 
among a group of pupils. This eliminates the risk 
of a test which may for example, be rather more 
difficult than the rest, being administered to a par- 
ticularly poor, or good school, when it would then 
be impossible to say whether the test or the school 
were non-average. By mixing the tests, not only does 
a full range of scores result, from all qualities of 
school, but from the test-average scores some 
grading of the schools is possible. For large-scale 
examinations, for which the above procedure is 
designed, it is important to select a range of schools 
from the best to the poorest, based on their earlier 
examination results. Clearly, if only the best schools 
were used for trial tests, a false picture would be 
obtained ofthe difficulty ofthe items, since eventually 
they would find their way into a paper taken by all 
schools. 


(c) Use of more than one test 


Sometimes two or more different tests in the same 
subject are administered to a group of potential 


19 


candidates. This may be because of the administra- 
tive difficulty of locating sufficient candidates to 
stabilise the resulting statistics, or because it is useful, 
for new subjects, to obtain a cross-check on the 
consistency of performance on the two tests. If this 
occurs, it is again better to arrange that all possible 
pairs of tests are used equally. Suppose there are four 
tests in Physics, say, lettered A, B, C, D. Then ina 
group of 30 candidates there would be six arbitrary 
groups of five, the six groups using the pairs: AB, 
AC, AD, BC, BD, CD. In this way 15 candidates 
take each test, yet every pair is equally used. Exact 
precision is not essential and with odd numbers not 
always possible, but it is better to take this statistical 
precaution to guard against another possibility— 
impossible to predict in advance. Just suppose that 
test A happened to include a few items similar to 
those in test B, then if in one school all candidates 
received first test A, then test B, while in another 
school the pair used was B-C, then the scores on 
test B would be inflated at the first School, but not 
at the second. Admittedly, this is a slender proba- 
bility, but the above pairing procedure eliminates 
the chance of the scores being biased if it happens to 
occur. 

A final point is in the use of keying items. This 
again is a technique to improve the information 
from test-trial scores. In each test five or so items, on 
different topics, are selected and these are included 
in different positions in another test, The eventual 
item analysis of these key items, from their Positions 
in different tests, provides a link between the tests 
If, irrespective of where they occur, the analyses ate 
closely similar, this shows that the trials have been 
well-conducted and increases confidence in the 
stability of the statistical values obtained for all 
items. 

This process of using keying items is best illus- 
trated by an example. Suppose there are three tests 
A, B, C, each of 50 items in English, then five items 
are selected from each of the three tests and added to 
the other two tests, increasing the number of items 
in each test to 60, rather like this: 


TEST A 
Select items 1, 7, 20, 31, 40 


TEST B 
Select items 2, 3, 9, 24, 30 


TEST C 
Select items 4, 7, 10, 19, 24 


20 


(The item numbers, of course, are used as 
an example.) 


The ten items selected from Test B and Test C are 
then added to the 50 items of Test A, in any suitable 
position. The test is then renumbered from 1-60. 
Similarly, the ten items selected from Test A and 
Test C are added to the 50 items of Test B, which is 
then renumbered. Finally, the ten items selected 
from Test A and Test B are added to Test C which is 
also subsequently renumbered. In this way every 
group of five selected items occurs in all three tests, 
So that 15 items act as keys to stabilise the results 
out of each 60 items. Thus, 25% of the test can be 
used as a key. If just the same five items are repeated 
in all three tests, the efficiency of the resulting 
analysis is much lower. 7 

The class-room test naturally has no need of this 
rather involved procedure, which is used auto- 
matically in any well-conducted trial test programme 
for papers to be eventually used on a national scale. 

The method of analysing the results of a trial 
test programme is given in detail in Part 6, and 
there is no need to discuss it again here. 


8. The synthesis of a final paper 


(a) Criteria for item selection 


After the trial tests have been completed, and the 
results analysed, a verdict is recorded on each item 
used. By this time the general accuracy of the items 
is not in doubt and the verdict is basically statistical 
in approach. First of all, the percentage of can- 
didates who answered the item correctly is examined. 
They are usually fixed from 25% to 75% as a0 
acceptable ‘facility’ level, but there may be reasons 
for amending these limits in particular cases, paf" 
ticularly for an examination which is to be highly 
selective. The general theory of this determination ©! 

limits in special cases is outlined later. Next, a” 

Most important, the selectivity of the item 1$ 
examined. The lowest acceptable value depends 0? 
the particular statistic used, as there are severa 
Possible techniques. In Part 4 one such technique iS 
detailed since it is easily used by those with negligible 
mathematical interest and, in this case, the CO" 
efficient measuring selectivity needs to be greatet 
than 9 40. The higher the value, the better the item 5 
discriminating. Any of these selectivity measures 
can be summarised as discriminating between the 


+ 


hea 


best and the weakest candidates. If the great majority 
of the best candidates answer an item correctly 
while very few of the weaker ones do, the item is 
highly selective. Clearly, if the success-rate is the 
same in both high and low groups, the item contri- 
butes nothing to our knowledge of the order of 
merit of the candidates and its selectivity is zero. 

Finally, it is necessary to examine what use has 
been made of the other answer options. A good item, 
while attracting most good candidates to the correct 
answer, adequately distracts the weaker candidates 
into the various wrong options. Each of these wrong 
options should be selected by some reasonable 
number of the weaker candidates. If any option is 
not used, then its place in the item is pointless. It 
should either be eliminated or a better one formed 
and the item tried again in its amended form. Here 
are two examples: 


ITEM 1: Analysis 
Option 

Top group 12 
Bottom group 35 41 


Option C is correct. This is quite a useful item, 
because 96 out of 159 in the top group are correct 
and only 14 out of the 159 in the bottom group. 
Equally, even in the top group, all the other dis- 
tractors are well used and in the bottom group too 
there is an even spread of use. 


ITEM 2: Analysis 


A B Q D E Omit Total 
9 96 31 1l 0 159 
j4 136 42 mn 159 


Option AB OD E Omit Total 
Top group 543141 2 3 1 19 
Bottom group 48 42 9 30 4 26 15 
Option C is correct. This is a poor item because 
although the correct answer shows a difference 

n E is hardly 


between 41 and 9 in the groups, optio : 
used at all and should be omitted or rewritten. 
Option A is too good—not only docs it attract far 
too many candidates in each group, but INED 
attracts a greater number in the top group than the 
Correct answer. This would suggest some ambiguity 
in the item so that either A or C may be correct, 
based on its interpretation. Certainly this item must 
be given more thought; why does Option A attract 
so strongly? It may be that Option A is à common 
misconception among students, but what is rather 
more likely is that from the advanced point of view 
of the item-writer A is clearly wrong, but at the 
student's own level the subtle distinction between 


options A and C may well be beyond his powers of 
reasoning. As in writing items initially, it is essential 
to ask: what is it reasonable to expect a student to 
know? What quality of inference and deduction can 
be reasonably expected at this level? 


(b) Decision on each item 

The verdict to be given on each item is only auto- 
matic in the clearest cases of acceptance or rejection. 
Many items need careful consideration, and only 
experience of all the variations that can occur can 
enable reliable verdicts to be delivered. For efficiency, 
the greatest possible number of items have to be 
accepted, and often a slight rearrangement or 
rewriting can convert a poor item into a good one. 
It is, however, necessary to send back any re- 
arranged item into the pool of new items for trial 
again. One of the powerful merits of a final paper of 
test items is that since every item has been trial-tested 
the precise performance of a final paper can be 
predicted within the limits of sampling error. The 
inclusion of untested items, or those whose per- 
formance will alter because of a rewrite, vitiates this 
predictive value and this is much too high a price to 
pay for the chance of using a few extra items. 


(c) Arrangement of selected items 

When all acceptable items in a subject have been 
selected, they are arranged within the sections of the 
test schedule on which they have been based. It can 
at once be seen whether sufficient items are available 
to cover this schedule in the correct proportions. 
For any regular test programme, with the equally 
regular production of final papers, there must bean 
ample stock in reserve. The test technician now 
enters the picture. It is his task to select sufficient 
items to create a full paper, making his selection 
purely on the test schedule proportions. The first 
draft of items—at this stage usually on separate 
cards, containing also their statistical record—is 
then submitted to the Chief Examiner whose other 
task will be to write the essay paper, with its mark 
scheme, which supplements the objective test, so 
thatthey make up between them the full examination 
in the subject. His task wj test paper is first of 
i fit, so that their 
iching order. If 
[3n such an order, as it 


the test schedulg/isMiiti 
usually is, ther C 
Chief Exami pentent;ofany item 


SJ 21 


he likes, and select from the stock a suitable replace- 
ment if he wishes. However, he is not allowed to 
amend an item for the reasons given above. He can 
accept or reject each item only. Once arranged in 
this order, he is able to construct his written paper 
with the test paper in view so that the two papers 
complement each other and together form a valid 
test on the whole syllabus. 


(d) Answer key 

It is the task of the test technician to compile an 
answer key. In a multiple-choice test with five 
options, each answer will be one of the letters A, B, 
C, D, E. Certain basic rules are essential however. 
The distribution of correct answers among the five 
letters must be quite random. Clearly if, for example, 
a run of several items each had the same letter as the 
correct response: AA A A A etc., a candidate might 
be predisposed to guess ‘A’ as correct after three or 
four earlier ‘A’ responses. Similarly, if a regular 
pattern of correct responses is used repeatedly: 
ACBED,ACBED,in groups of five questions, 
a quick-witted candidate will soon see this and 
complete difficult items by continuing the pattern. 

Since there are 120 ways of permuting the five 
letters A B C D E, over 600 items can be written 
without once repeating a particular pattern of the 
five letters, so there is certainly no excuse for 
running out of patterns. Equally if there are several 
tests in use, of say, 50 items each, it would be unwise 
to use the identical answer key for each test. 

It is also desirable, as far as possible, to equalise 
the number of times each letter is used as a correct 
response in any test. For 50 items, there should be 
therefore 10 of each of the five letters used in a 
random way, so that if any candidate, for example 
was inclined to pick on one letter automatically, 
whenever he was compelled to guess a response, 
there would be no chance of an accidental bias in 
his score. If say there were 20 E's as correct in the 
50 items, and few A's, a candidate tending to guess 
‘E when in doubt would be lucky and likely to have 
a better score, than another tending to guess ‘A’ 
when in doubt. This is not an artificial idea: ques- 
tioningand basic research, particularly in the U.S.A., 

does show that some—not all—candidates when 
guessing answers, tend to have a favourite option 
and the first ‘A’ and the last ‘E’ are more popular 
than others. Guessing also goes on with written 
papers, and is equally inevitable, although the 


22 


amount is virtually impossible to measure since 
there is very little cross-checking built in to such 
papers. 

Once the answer-key has been made out, the 
instructions to the candidates need to be written. 
There is a fairly standard form used here with 
suitable minor variations for particular subject 
needs. It is essential to use very simple language for 
these instructions and to show one or more com- 
pleted items as examples. At each stage of a test, if 
the type of item changes, to be followed by a group 
of items of a new style, an example needs to be 
inserted in the text before the new group. Stock 
examples are usually compiled and used for several 
years. The aim throughout is to ensure that no 
candidate, however weak he may be at the subject 
being tested, fails to understand exactly what he 
has to do to complete his answers. f 

The form of the answer required in a machine- 
scored test can vary from recording the letter, oF 
number, of the chosen option on a standard answer 
sheet, to completing in pencil a rectangle on a score- 
sheet, against the numbered item. Specimens of two 
typical answer blanks are shown in the appendix. 


9. Validation of a test 

(a) Validity and minimum error score 

A testis said to be ‘valid’ when it measures accurately 
the specific ability which it purports to measure. A 
criticism levelled at, for example, English essay 
papers is that they are not valid, i.e. the marks gained 
bya group of candidates bear no relationship to their 
true ability in the practical use of written English. 
To illustrate the effect of greater, or less validity in à 
test consider an example. Two candidates score 
respectively 60 and 35 in some academic test. If the 
test is valid, the difference of 25 in their marks 
measures the real difference in their abilities in this 
Subject. On the other hand, if the test has a low 
validity, then each of the two marks will consist of 2 
part due to the ability in the subject, and another part 
to one or more totally irrelevant factors. In total, 
this is called the ‘error score’ since of whatever it is 
composed it is not part of the ‘true score’ of the can- 
didate in that subject. Thus the 60 marks may be, 
Say 45 true score plus 15 error score, while the 35 
marks may be, for that candidate, 30 true score plus 
5 error score. Hence the true difference in their 
abilities is 45—30—15 marks, instead of 25. This 


r 


pum 


ER 1 


discrepancy decreases as the validity rises simply 
because a high validity implies lower ‘error scores’. 
Perfect validity is unattainable, but values over 90 75 
are not uncommon in well-standardised tests. In 
practice, particularly in the psychometric field, 
where personality traits are less clear-cut than 
academic abilities, an individual test may have quite 
alow validity. By using a battery of several carefully 
selected tests however, and analysing the contri- 
bution each makes to the required measure, and 
indeed other measures, it is possible to weight the 
scores on the tests to eliminate or greatly reduce 
unwanted effects, and concentrate them upon 
desired measures, thus achieving high validity. A 
similar process is possible in the academic field, and 
for research purposes, is commonly used as a 
technique. 


(b) ‘Face’ or ‘content validity’ 

In practice, it is assumed that a subject-test for use 
in schools, carefully constructed on a detailed test 
schedule, is inherently valid for that specific purpose. 
This is usually called ‘content validity’, or sometimes 
‘face validity’, It would however, be improper to 
apply such a test to a completely different group at 
pupils working on another syllabus and assume 
automatically that the ‘content validity’ is un- 
changed. This is the basic reason why a test designed 
and standardised, in say the United States for 
13-year-old children, cannot be used, unchanged in 
West Africa, for a similar group of 13-year-olds. 
It may still be valid, but it requires research to prove 
the point, No academic test is ‘culture free’, as the 
Phrase goes; that is, no test is free from the influence 
of environment and training of those taking it. Some 
standard tests claim to be ‘culture-free’ in that 
results are apparently not affected by the back- 
ground culture of the candidates. In most of the 
examples known to the writer the differences in the 
backgrounds of the candidates have not been very 
great, and the evidence that no part at all of the 
Tesulting score can be attributed to that background 
1S not always statistically sound. Where m 
environments or culture-patterns are widely dif- 
ferent, the same test will give different results, due 
to the built-in culture element in it. Hence ‘content 
validity’ has a limited use and, in any case, is not 
Subject to measurement directly. It is a judgement 
Only, based on expert opinion in the subject field 
and on the test groups. 


(c) Validity by ‘follow-up’ research 

The more crucial form of validity is best called 
‘follow-up’ validity. This is a statistical measure, 
based on the correlation between two sets of scores. 
The first set of scores is obtained from the test 
needing validation. At a later date a second set of 
scores is obtained from a suitably chosen criterion. 
This second set of scores may be those on a stan- 
dardised and validated test. The new test is thus 
validated from an earlier validated test. On the other 
hand, it may be a set of scores from teacher’s tests, 
given up to a year later, to those candidates selected 
by the first test. 

To give an example for illustration, suppose 
10,000 candidates take a selection test for secondary 
school entrance. On the results 2000 are selected. 
After a full year in their new schools, marks are 
collected for all 2000 pupils based on term tests, 
class-room tests, and the annual school examina- 
tion. These will of course vary in standards between 
schools, but either they can be converted to common 
standards or, if the scheme is part of a follow-up 
research, the schools can be given new tests or asked 
to adhere to common standards for marking. From 
all these marks an order of merit can be drawn up, 
even if at worst, this has to be a school by school list. 
This order is then compared with the order shown 
in the original selection examination and from the 
correlations obtained the validity of the original test 
can be estimated. It is never an easy task, nor a 
quick one, yet if any test is to be worth while on a 
national scale, such a follow-up validity study must 
be carried out. Theoretical upper limits to the 
validity can be evaluated from the original test data. 
If such limits are low, clearly the test needs improving 
first before time and effort is devoted to finding a 
practical validity measure which will be too low to 
be useful. 

One problem of most follow-up studies is that 
candidates are lost. For example, in the above 
illustration although the ultimate success of the 
2000 selected candidates can be followed, the other 
8000 who have not been selected leave school and 
disappear into various jobs, and save for a few, 
have no further educational progress to measure. 
This attenuation of candidates can be compensated 
for by statistical methods, but this is not a complete 
answer to the problem. 

Again, where practical situations arise enabling 
such candidates to be followed up, the evidence 


23 


suggests that if good validity values are obtained 
from the selected group, such a selection has been 
accurate within the bounds of test accuracy. In 
Ghana for example, for some years selection for 
secondary schools was made initially from the 
second year of the four-year Middle School course. 
Those not selected at that time were able to try 
again a year later, and a further attempt was still 
possible in the fourth year. It was found that the 
great majority of selected children came from the 
second year level, helped of course because secon- 
dary schools naturally prefer to recruit younger 
children. Nevertheless, allowing for age differences, 
still relatively few were selected from the third year 
at their second attempt, and hardly any at all from 
the fourth year at their third attempt. (In fact, many 
of the selections from the fourth year were, by design, 
for trade and technical schools.) This showed that 
the selection procedure was accurate enough to be 
used once for each candidate, the relatively small 
selection from the repeat years Tepresenting the 
degree of error in the original selection and Possibly 
a few late developers. For the great majority of 
unselected candidates, further attempts merely 
confirmed their lowly scores. In this case, the 
evidence of repeat scores gave an estimate of the 
validity of the general test Procedure. Tests of 
different, but parallel, form were used each year so 
it was impossible to assign an individual validity 
coefficient to a particular test, However, this is a 
perfectly legitimate procedure for obtaining an 
annually recurring set of ‘sequence validities’, If of 
course, the style of the testing is changed to a totally 


different form, the Sequence is broken and a new 
series needs to be compiled. 


(d) Validity and reliability 


A final point ; reliability and validity are two funda- 
mental statistical measures of any test. It is com- 
paratively easy to obtain high reliability so that 
virtually the same score is Tepeatable for each 
candidate if the test is repeated later. Yet a highly 
reliable test is useless if there is low validity. For 
what is the point of making an accurate measure- 
ment, if it is not possible to say with any confidence 
exactly what is being measured? On the other hand, 
a test with a low reliability is very unlikely to have a 
high validity, hence the first aim must be to achieve 
a high reliability. Some tests have only moderate 
reliability and equally moderate validity, particularly 


24 


in tests of mental aptitude or personality charac- 
teristics. Combined with others however, and 
suitably weighted, they can be used. Where reliance 
is placed on the results of a single test, it is essential 
that reliability and validity are each at least of the 
order of 0-90 or 90% of the theoretical maximum. 


(e) The core problem in validity 

The above arguments are included so that the reader 
has some idea of the meaning of the term ‘validity’. 
The actual problem of determining validity accur- 
ately is a difficult one, since it implies some criterion 
against which the new test is to be ‘validated’. Since 
in many situations the only criterion available is 
either an earlier test Score, or even worse a written- 
style examination result, it may well be that the new 
test is more accurate in its measure than the earlier 
ones, but there is no way of proving this. 

There are advanced statistical techniques available 
usually involving several suitable weighted criteria, 
but not only are they beyond the scope of this present 
discussion, they too, imply that reliance is still 
placed on earlier validation studies. 

A more limited approach has had some value, 
Since one of the purposes of introducing objective 
tests as part of an examination system, retaining in 
cach subject tested written papers as well, was to 
compare the results, subject by subject, between the 
two halves of an examination. If a reasonably high 
correlation could be obtained, it would be possible to 
claim that both parts of the paper were helping to 
measure similar attainment. A total score obtain ed 
from the separate scores, weighted or standardised 
as desired, would be a better measure than eith T 
score separately. Hence those parts of a subject 
amenable to objective testing would have the 
advantages of such tests, while the remainder 
particularly in the field of connected ideas an: 
logical presentation of material in written form: 
Would remain as a conventional paper. 

As experience builds up, year by year, in this style 
of examination, particularly as the few really 
experienced examiners can be concentrated on the 
marking of the written material, standards are 
established for both kinds of paper; the inherent 
Stability of the pre-tested objective element acting 
eventually as a standard by which the relative 
difficulty of the written paper from year to yea 


always more likely to vary, can be accurately 
assessed. 


Part 3. Specimen Items Illustrated and Discussed 


The purpose of the following questions is to illus- 
trate the types of multiple-choice items that can be 
used, and in some cases, a short discussion shows 
up a weakness and suggests possible ways of im- 
Proving the item. More detailed studies are included 
in the books in this series dealing with individual 
Subjects, 


Examples from mathematics are 
" Example 1: A car travels 93 miles in 21 min. 40 sec. 
he average speed in m.p.h. is: 


A. 28 B. 27$ C. 27 D. 25% E. 133$ 


3 pres item because some candidates always tend 

obte gem an exact answer, although answer B is 

fins E if the 40 sec. is neglected. E is unreasonable 

at iiber Speed and there is no simple way of arriving 

Su er A or D results. What is being tested here? 
rely the only real target is the correct use of the 

distance travelled , 

W time elapsed ~ 

Bin i not therefore simplify the data? Ó 

A nce travelled=9 miles, time taken 24 mins. 

fte nSwers could then be, all reasonably possible by 
Orrect cancelling, all reasonable in value: 


A. 114 B. 212 C. 224. D. 45 E. None of these 


idea that average speed — 


m the simple answer is not correct, C being a 
fractional value like A and B. The use of option E 
Pos when a weak student fails to arrive at any 
sh the four stated results. Occasionally, however, it 
Ould be the correct answer. 
i 65 
Example 2: In the equation Z s=3x— p p 
value of x lies between: 


A. —land +1 
B. +1and +3 
C. +3and +5 
D. +5and +8 
E. +8and +11 


All ranges are possible if one or other of the common 
errors in solving simple equations is made. The use 
of the above range-technique prevents the sub- 
stitution in turn of the given answers until one is 
found to fit. Substitution in an algebraic formula is a 
different study and is tested specifically by suitable 


items. 
Example 3: The diagram shows three intersecting 


circles. Each circle represents a group of people. 
Circle I represents all people over 6 ft. in height. 
Circle II represents all football players. Circle III 
represents people possessing a bicycle. What 
statement below correctly defines the area marked 


E on the diagram? 


Ti 


A. Persons over 6 ft. in height who possess 
bicycles. They do not play football. 

B. Persons under 6 ft. in height who play football, 
but do not possess bicycles. 

C. Football players, without bicycles, but over 
6 ft. in height. 


25 


D. Persons not over 6 ft. in height who play 

~~ football and Possess a bicycle. 

E. It is impossible to say anything about this 
group. 

Example 4: In the above diagram, which area 


could represent persons over 6 ft. in height who 
neither play football, nor own bicycles? 


A. B. C DEE 


Such items can usually be answered by the more 
intelligent candidate quite easily, even if they have 
had no formal experience of modern topics in 
Mathematics. Similar items using the notation of set 
theory and the concept of operators have been tested 
at G.C.E. Ordinary level on various groups. They 
are not easy for the candidates not trained in such 
topics, but act as good intelligence items. It is com- 
paratively easy to write mathematical items, but an 
endeavour should always be made to analyse exactly 
what process of Teasoning is expected from the 
candidate. Especially at elementary levels it is better 
to set a lot of very simple items, each testing one idea 
at a time. This enables a clas 
very accurately the exact points of difficulty or 
failure to comprehend some 


Example 5: 
Here is a sketch of a graph with some points 
indicated, 
QUESTION 1 
The equation of this curve is m 
A. y-ix44 
B. y-Qx-1)9—4) 


C. y=4x?— 4x424 


ost likely to be: 


E. y— Gia) 


QUESTION 2 
I: A minimum point, II an inflexio 
turning points. 
This curve shows: 
A. I only; B. II only; C. III only; D. I and II, 
but not III; E. None of these things 


n point, ITI two 


26 


QUESTION 3 

If x is negative, the values of yare: 
A. negative; B. positive; C. may be both 
positive or negative; D. always infinite; E. not 
exactly determined 

QUESTION 4 

Negative values of y only occur if 
L x>4; Il. x<4; III. x>0; IV. x»4; 
V. —43«x«4 

Which of the above statements correctly gives all 

negative values of y? , 
A. I only; B. II only; C. I and II together; 
D.IandIV together; E.V only 

QUESTION 5 

The co-ordinates of the point P are: 
A. (0, 4); B. (4,0); C. (0, 2); D. (0, 24); E. (24, O 


Examples from Biology are 

Example 6: During photosynthesis a plant: 
A. loses starch; B. manufactures starch; C. becomes 
long and thin; D, loses its green colour t 
This item failed because options C and D were on 
Selected, being clearly wrong. A and B, a 
Opposites, were almost equally chosen. This pP ; 
that the word ‘photosynthesis’ triggered off t 
word ‘starch’, but most candidates were thereafter 
obliged to guess between A and B. 

Example 7: Loss of water froma leaf system can be 
demonstrated with the aid of: 
A. cobalt chloride paper; B. litmus paper; C. Copp?! 
chloride Paper; D. soda lime 


A potentially good question spoiled by its framing. 
Clearly most pupils had heard of the experiment, 
but one wonders again how many understood the 
Principles. Choices B and D were virtually ignored, 
607; chose A correctly, 36% chose C. Clearly 
ism Paper’ acted as a trigger, the distinction 
It rm cobalt and ‘copper’ being a vague one. 
i Sometimes possible to turn a question round to 
x Prove it, e.g. ‘In which one of the following 
XPeriments is cobalt chloride paper used to 
demonstrate the effect...? 

Example 8: In cases of myopia, spectacles are 
needed which contain: 


A. convergent lenses; B. coloured lenses to 
Teduce glare; C. lenses to block ultra-violet 
light; D. divergent lenses 


i type, firstly since *myopia' had to be 
the i Y identified with ‘short sight’, from which 
had ee of divergence to ‘lengthen’ the sight 
he deduced. Option A was clearly a direct 
who M Ej but B and C also distracted weaker pupils 
Eran to guess the meaning of ‘myopia’. i 

tee 9: Here is an example of a double item, 

Y marked as two separate results, but 


eco; ising in į 
nomising in ideas and space. 
ucor 1S; 


A. a filamentous alga; B. a green plant; 
C. a fungus; D. a moss 


and may be found: 


A. in muddy streams; B. on soil in damp places; 


c d dead trees in forest areas; D. on old moist 
00 


T Feceraphy a wide variety of question can be used. 
head of items that can be set on a suitably marked 
cannot ee are shown below. The actual map 
can b e reproduced here, but any sheet extract 
ie thes Selected to include the features mentioned 
n the items, 
X dune 10: Examine the watershed of the Elemo 
basin. There are six numbers, 1 to 6, clearly 


Marked on th i i this 
Watershed ? e map. Which numbers lie on 


ri 


A. 1 and2 only; B. 2, 3 and 5 only; C. 5 and 6 
only; D. 1,3, 5 and 6; E. 1, 3, 4, 5and 6 


e 
Mes an item like this, one of the indicated numbers 
*Kely to be most easily selected, either omit this 


number or include in all, or all but one, of the 
answer combinations. Its absence can mark a 
wrong answer very easily and vitiate the true 
selectivity of the item. 

Example 11: There are four features marked W, 
X, Y,Zonthe map. Which one group below includes 
all of the four features? 

A. steep slope, spur, conical hill, gentle slope 

B. scarp slope, col, steep slope, wide valley 

C. spur, concave slope, gentle slope, col 

D. convex slope, flat topped hill, precipitous 

slope, col 

E. steep slope, artificially drained plain, spur, col 
This type of item allows many different features to be 
examined in one question. An alternative use of the 
four features is by matching: 


Woosssess A. spur 

ME Seay B. concave slope 
WMasessonss C. gentle slope 
VA D. col 


E. steep slope 


The defect of the matching technique, for large 
scale use, is that the choices available are reduced to 
five, as shown, whereas in the earlier form of 
Example 11 many more than five different features 
can be named, although of course, only a speicfic 
four are correct. On the other hand, Example 11 is 
one item; the matching method converts it to four 
items. 

Another technique is to draw a map of a mythical 
place, or perhaps an island, showing latitude and 
longitude lines and marking one or two distinct 
points by letters. It is then possible to set a series of 
items, including finding locations using their 
latitude and longitude; asking for bearings from 
one point to another: time differences based on 
longitude; features such as the longest day and the 
height of the mid-day sun. If an actual place is given, 
memory can often play too big a part, but if the 
results have to be deduced from a mythical place, 
the comprehension of the relationships involved is 
the major field of the testing. These items are neither 
easy to construct, nor do pupils find them easy to 
solve. Only the best fully succeed. Dependence on 
memory alone is rather ruthlessly exposed and, 
indeed, whenever comprehension of principles and 
general ideas are tested, the failure rate rises 
sharply. A group of items can be made to depend on 
a single set of terms. For example: 


27 


Here are five terms: 


A. Radiation; B. Convection; C. Condensation; 
D. Evaporation; E. Lapse rate 


In the statements below, select the term most 
closely related to it: 
Example 12: The change from water 
vapour to cloud formation. (C) 
Example 13: The upward movement of air, 
heated by contact with the earth’s surface. — (B) 
Example 14: 'The formation of dew on a 
cold surface. (C) 
The list of questions can extend as far as ingenuity 
allows, for the same term in the list can be the 
correct answer for more than one question. 
A world map can be provided, suitably shaded, 
lettered or numbered, and items built round it. 
For example, given a suitably marked map: 
Example 15: Which letter is marked in the 
countries named below: 


l. Japan 2. Tunisia 3. Ecuador 
4. Argentina 5. Egypt 6. Morocco....etc. 


This list of countries should exceed the number 
actually marked. 


English Language tests 


The number of examples will not be great in this 
subject, partly because there is at 
review of the teaching techni 
because in the Grieve Repo 
examining (16) there is qu 
specimen items. 
Comprehension questions ar 


Cor : € very effective, The 
principle here is to provide a carefully selected 


need to be 
nd then ask 


present an active 
ques needed and partly 
rt on English Language 
ite a large collection of 


idioms, words or phrases, as used in this passage. 


hin the genera] 
t they are not 
entity with the 
of a teacher can 
tems can follow 


(1) ‘The writer says that... followed by four or five 
separately-lettered sentences on the same theme. 
only one of which accurately summarises the 
writer’s statements on this theme, 


28 


(2) Vocabulary items, selecting a word and offering 
four or five other words, or perhaps phrases, for 
the choice of the best synonym: similarly 
antonyms can be offered. This item in isolation 
is rather feeble and it is essential to link such 
items with the words in context, where the exact 
meaning or usage of the word is quite un- 
ambiguous. 

Punctuation and spelling can be introduced, but 
with considerable caution. One of the better 
ways, but admittedly difficult to construct, is to 
offer several identically worded sentences, 
differently punctuated, and to ask which of them 
has the same meaning as a paraphrased given 
sentence. Spelling errors are less likely to be 
useful, but can be used in class-room tests to 
stress common errors. Elaborate punctuation 1s 
not essential at G.C.E. ‘O’ level and the more 
esoteric forms found in some literature are not 
valid areas for testing at this level. 


G 


— 


Here is a simple composite set of examples, involving 
Several ideas at once. 

Example 16: In the passage below, several places 
are indicated by numbers in boxes, where mm 
lining or omission has occurred. Select from the 
Choices offered the correct word, or pinetan E ; 
be inserted in the spaces, so that the passage ma es 
sense. 


of 
A sentence is made up of | 1 ] several short » 
varying importance; therefore the sentence [3] 
built in an orderly style, phrase [4] 


[1] ‘is made up of’ can be replaced by: f 
A. is built; B. consists; C. composes; 


64%) 
D. is composed of ( 
[2] A. phrases; B. words; C. lines: (26%) 
D. clauses " 
ill: " hould; 
E b * must not; C. $ (1620 
. canno : 


[4] A. and clauses; B. by phrase; C. after Q4:0 
clause; D. with clauses i 
correc 


tudents- 
t 


The percentages show the percentage i i 
answers from a trial group of ‘O’ a 
Another simple test of comprehens!© 
lower academic levels is the task of T 
jumbled sentences into a correct orde 
simple story. 


| 
| 
| 
| 


Example 17: 


A. One was very large, the other smaller 

B. The visiting team was in it 

C. The small one was a green bus 

D. After a short wait we saw two vehicles arrive 
E. The other vehicle carried the spectators 


"Read the above sentences, decide the correct order 
in which they should be placed to tell a story, and 
then write below the letters of the sentences in the 
Position you would place them." 


First sentence (D). Third sentence (C). 
Last sentence (E). 


(The only possible, logical order is D-A-C-B-E, but 
if this order as it stands is asked for, marking is 
difficult if, say, there is one or more in the wrong 
Order. The above method asks three specific ques- 
tions and each is either right or wrong. On the whole, 
these items are rather easy to do.) : 
Errors of spelling induced by a misunderstanding 
based on poor pronunciation are also effective at 
lower levels, e.g. at the secondary entrance selection 
Stage. They also make excellent items in classroom 


tests for the elimination of common errors. Examples 


are: 'lives-leaves', work-walk', ‘formally-formerly’. 
Here is a full example. 
Example 18: Read the four sentences below. 
Indicate which one, if any, contains a spelling error, 
9r a word incorrectly used. If all four sentences are 
Correct, write the letter ‘E’ on your answer sheet. 


A. Heights make me very dizzy 

B. His advice was always worth serio 

C. He worked every day to his office 

D. There are many new designs of furniture these 
days 


us thought 


but ‘to’ could 


three words 
ts’ is often 


C employs the ‘worked-walked’ error, 
equally be changed to ‘in’; B: includes 
spelt wrongly fairly often; A: ‘heigh : 
Confused with ‘high’ from spelling viewpoints; 
D: ‘designs’, and the general plurality offers the 
common trap of ‘furnitures’, even seen not in- 
frequently in newspaper advertisements. _ h 
A final form, often rather more difficult, is to omit 
a key word—a preposition, article or adverb—and 
Tequire the selection of one from a list given, which 
x be inserted to allow the given sentence to make 
nse, 


Example 19: For your assistance, we should never 
have finished before sunset. 


A. unless B. if C. because D. however E. but 


(‘But’ is needed initially, yet ‘For’ has to begin with 
a capital letter to avoid a give-away for position.) 
This type of item can be made very difficult and is 
useful for ‘O’ level selection of better candidates. 

Example 20: My elder brother is not tall as I am. 


A. tall B. as C. that D. much E. too 


The other alternatives are typical of common speech 
errors, i.e. ‘that’ or ‘tall’ to terminate the sentence, 
or ‘much tall’, or ‘too tall’. The inclusion of ‘so’ as 
an alternative as well as ‘as’, tends to confuse, since 
some would say that either could be correct. 

Thus many items in language testing need expert 
knowledge of local common errors of speech and 
teachers can, and indeed should, compile a long list 
of such errors. Artificial distinctions, particularly 
where there is still disagreement among experts, 
should certainly not be tested. The arguments, for 
example, about ‘will’ and ‘should’ are certainly too 
arid for live testing at ‘O’ level. Test situations in 
English Language, particularly where grammatical 
structure is involved, must firstly be in context and, 
secondly, should be focussed on real difficulties and 
errors in the community. There is no point in trying 
to test an error form which is not encountered in 
practice. 


Foreign Language testing 


The commonest modern foreign language in schools 
is French, hence examples will be confined to this 
language. Yet the general pattern of items in any 
language can follow the forms used in English 
Language testing. 

There is a useful technique for a teacher in testing 
aural response objectively. A tape-recorder helps a 
great deal, but is not essential. A set of answer 
options is handed round the class and the teacher 
then reads, or plays, a short simple passage in the 
language—perhaps a conversation between two 
people, or merely a descriptive passage from a book 
in class use. Twice the passage is repeated while the 
class listens, and it is hoped, comprehends. The 
instructions for the replies needed are also spoken 
in the foreign language. Then the teacher asks 
several questions, e.g. No. 1...? The class sees five 
possible answers (in the language) offered for 


29 


Question 1, and they mark their chosen answer. 
Similarly, further questions are asked and answered. 
This implies aural comprehension of both the 
passage and the questions and visual comprehension 
of the printed (or cyclostyled) answers. This is a 
useful test procedure to examine aural standards: 
reading and translation skills are readily tested with 
printed questions as usual. 

By gradually increasing the reading speed and the 
rapidity of asking questions the teacher can en- 
deavour to eliminate the mental translation to and 
from the language which, because all students begin 
by doing it, inhibits their own progress in rapid 
dialogue. 

The correct use of parts of speech or verb tenses 
can readily be tested in sentence context, as below: 


LIST: A. soit B. était C. fut D. est 


Select the correct part of the verb ‘étre’ from the 
above list to make each sentence correct. If No part 
of the verb correctly fills the space, record ‘E’ as 
your answer. 

Example 21: Nous verrons la directrice, si 
elle...au lycée. 

Example 22: Le jour...un espace de vingt-quatre 
heures. 

Example 23: Elle...trés Occupée parce qu'elle 
partait le lendemain. 


Instructions can also be given in the language for 
better tests of reading skill: 

“Choississez un de pronoms relatifs et remplissez 
l'espace sur votre carte de la lettre du pronom de 
votre choix.’ 

A. où B. que C. qui D. dont E. duquel 

Example 24: Je vais là...il demeure. 

Example 25: Cet homme...je connais la force, 
n'est pas méchant. 

Example 26: L'ami...j'ai est fidèle. 


Science tests 


Two aspects need to be considered: both the usual 
form of testing and theoretical knowledge, both 
factual and more complex ideas; and also, the 
construction of items which can only be answered 
successfully if the candidate has undergone a 
practical course in a laboratory. Of course, the 
practical test in a live laboratory situation will be 
continued, but it is useful training to analyse what a 
student is expected to learn specifically from a 
practical course, which he cannot learn by text-book 
study of descriptions of experiments, or by a course 


30 


of lectures or lessons. It is perhaps unfortunate, 
however necessary because of crowded conditions, 
that one of the traditional techniques in school 
laboratory courses is to give a pupil a printed card 
detailing his apparatus, his method of experiment 
and the way to obtain his results and present them. 
Because he faithfully follows the carded instructions, 
he often fails to think for himself about the experi- 
ment and learns little more than he could obtain 
from a good descriptive book on class experiments. 
His 'writeup' of the experiment is often very 
mechanically done and is not really a part of his 
absorbed understanding. The whole of an experi- 
ment can be tested by a set of objective items, 
beginning with apparatus selection and assembly, 
then recording data, then tabulating or graphing the 
data, and finally the deductions to be made. To 
answer such a set without recourse to notes or books 
implies, if the items are searchingly set, a compre- 
hension of the underlying theory and the reasons for 
the selection of the data to verify a stated law. Trial 
results have shown that where a candidate has 
inadequate practical knowledge, however good his 
theoretical knowledge, he is incapable of ^ lan 
the majority of items based on purely apoena 
work. In setting such items, teachers need to stuc y 
every aspect of an experiment to find out the essentia 
skills and where accuracy is most important, ae 
analyse the effect on the results if one or other 
experimental precautions is not taken. . 

Here are two simple examples from Physics: : 

Example 27: (A straight-line graph is illustrate 
plotting T? against L). i 
An experiment is performed to determine t 
of ‘g’, using a simple pendulum. The 


r-x J? is used. T is in seconds and L in centi- 
8 


metres. iven. 
Taking z as 3-14, determine from the graph £ the 
the value of ‘g’ found by the experiment to 
nearest whole number. 


A. 980 B. 981 C. 32 D. 31 E. 983 


(32, 31 are impossible with the data, unless m 
candidate just guesses C by thinking in f.p.s- a B 
or D as an experimental value.) Similarly, ^ 2™ 

are often used in calculations, but neither could be 


correct. Only by correctly interpreting the slope of 
2 
the graph as =, reading the slope accurately and 


he value 
formula 


evaluating ‘g’ from the data, can the correct result 
be obtained. 

Example 28: (A Centigrade thermometer is 

shown alongside a graduated metre rule. Calibra- 
tion marks corresponding to, say 0°C and 100°C are 
shown on the rule. The mercury level in the thermo- 
meter is clearly shown against a reading on the rule. 
The diagram is explained.) 
The question is simply: ‘What is the temperature 
recorded by the thermometer?’ The calculation is a 
simple proportion, correctly carried out in relation 
to the rule marks. If this simple experiment in 
calibrating a thermometer has never been tried, 
using melting ice and boiling water, only the better 
candidates will be likely to deduce the principle and 
method of calculation. The alternative incorrect 
options are those which would be obtained if the 
rule-marks are incorrectly applied in the proportion 
Statement. 

The interpretation of a graph resulting from an 
experiment is an excellent way of testing under- 
standing of the physical principles underlying the 
experiment. Usually several items can be built 
around each graph. Consider a graph showing the 
temperature of water under pressure and subjected 
to a constant supply of heat until after it has reached 
its boiling point against time. The graph starts as a 
straight line with positive slope which eventually 
levels off into a horizontal line. Items can be asked 
on: 


(i) The rise in temperature per minute. . 
(ii) The amount of heat in calories being supplied 
per minute, the quantity of water being given. 

(iii) The significance of the horizontal section (the 
water is boiling and therefore not rising in 
temperature any more). 

(iv) The temperature at which boiling occurs (many 
candidates will automatically and thoughtlessly 
choose 100°C and reject the point shown on the 
graph which is higher because of the pressure). 


: A rather complex, but useful, item in Science is to 
illustrate apparatus, or electrical circuits, connected 
In various ways to perform an experiment. Minor, 
but vital, errors are introduced into all but one 
Picture, which has to be selected. Another variety of 
Practical item leads a candidate step-by-step through 
the build-up of apparatus to determine some stated 
experimental result. The result is said to be poor 
and the nature of the error given. Which step 15 


either wrong, or insufficiently detailed, so leading 
to a faulty experiment. 

A simple calculation can often be linked to an 
experiment. For example, the determination of the 
specific gravity of a small solid—lead shot, sand, 
mercury, etc.—by using specific gravity bottles. 
Diagrams show the bottle empty, with solid only, 
solid and water, and water only, with the weights 
underneath. From this experimental situation the 
specific gravity can be calculated, but no automatic 
formula is provided. It could, of course, be memor- 
ised, but this is unlikely—there are so many possible 
situations that could appear—hence the result needs 
to be obtained by reasoning from the first principles 
underlying the experiment. 

Another, more involved item, illustrates two 
similar experiments side by side. For example, 
experiments to determine latent heats of, say, ice and 
steam respectively. In each experiment list say three 
points where standard precautions to preserve 
accuracy are omitted, e.g. the failure to dry the ice 
in the first or the inefficient lagging of the calori- 
meter in the second experiment. The requirement is 
to select which flaw in the first experiment will 
produce a similar error to a similarly selected flaw 
in the second experiment. There is only one correct 
pairing to be possible in the nine pairings available 
from three flaws in each experiment. Clearly the 
teacher setting the item needs to be really expert in 
his appreciation of detail, yet such items are highly 
selective of well-trained candidates. The variety and 
flexibility of this form of item is so wide that coach- 
ing for it is virtually impossible, yet as such items 
become part of the stock of science teachers, their 
explanations to a class of the implications behind 
such flaws in experimental technique can surely only 
lead to beneficial results in science teaching. 

Similar items throughout can be constructed in 
Chemistry too, with the added variety of problems 
involved in balancing chemical equations, valency 
concepts, and chemical reactions and processes. It 
is necessary in setting Chemistry items however, to 
take care that those which at first thought seem to 
require deductive reasoning are not in fact straight- 
forward memory items. In Chemistry, far more than 
in Physics, students tend to memorise chemical 
formulae, and have also been taught quite mech- 
anical methods of calculating for example, the 
number of atoms involved in a reaction expressed as 


a chemical equation. 


31 


Part 4. Elementary Item Analysis for Use in Class-room Tests 


1. The average score 
The first simple measure is to obtain the average 
Score for the class. This is already calculated in many 
Schools for the existing examinations. 
All of the marks for the class are added together 
and the total divided by the number of students for 
the examination. 
ie. 

Total of all marks (EX) _ y 
No. of marks added N 


[The Greek Y; is used in mathematical notation to 


indicate summation.] 
ee a 


Example: 9 boys take a test of 50 items, each item 
Scoring 1 mark. Their marks are: 


11, 15, 24, 28, 28, 29, 32, 41, 47 


Average mark — 


Then 
(X X) = 114+15+24+28+28+29+32+41 +47 
=255 N=9 
Average = X = z3 = 2833 


(say 28, for class-room use). : 
This is an average of 28-33 out of a total of 50 items. 
ie. 

28-33 


"EX 100% = 56-66%, say 57% of total marks. 


2. Evaluating each item separately 


For each item, two measures are needed to assess 
value in the test as a whole. These are (a) Facility, 
(b) Discrimination, D. 


its 
F, 


(a) Facility, F 

This simply measures how many candidal 
answered the item correctly, as a fraction or à 
Percentage of the total number of responses. 


tes 


To obtain the value, it is useful to arrange the 
answer cards in order of merit from the top down- 
wards. Start with Item 1. Suppose the correct 
answer was ‘B’. Count through the cards, noting 
how many candidates correctly gave ‘B’, and also 
note (if any) how many candidates left the response 
blank (omitted response). Repeat with Items 2, 3, 
4—in turn. Each time a ‘blank’ response is seen, 
check that the answers following are not a// blank. 
It is necessary to distinguish between a casually 
omitted item in the main body of items, and a 
string of omitted items through to the end of a test. 
The weaker candidates may have been unable to 
complete the test and having, say, attempted about 
36 items (with occasional omitted items), have not 
had time to attempt items 37 to 50. This unin- 
terrupted string of omissions to the end of the test is 
recorded as ‘unattempted’, and each time such a 
card is found, it is removed from the pack as soon 
as the first of the unattempted string is seen—i.e. in 
the above case, at the time item 37 is being counted. 

Finally, for each of the 50 items, the following 
figures will be recorded. 


1. No. of correct answers R 
2. No. of omitted answers =0 
3. No. of unattempted questions = U 


Suppose the total number of candidates for the test 


R j , 
is N. Then the facility, F— N-U for each item in 


turn. 
Example: For item 1, all 60 candidates gave an 


answer and 48 were correct: 
48 
N60 R=48 U=0 F—-—080 
60 
or if preferred, multiply by 100 and say F—807;. 
For item 44, only 42 gave some answer, 4 omitted 
the item and the remaining 14 had by this time gone 
as far as they could. The 4 who omitted item 44 are 
NOT counted as unattempted because each had 


33 


answered at least one /ater question. The 14 un- 
attempted not only omitted item 44, but omitted 
all subsequent items too. Indeed some of the 14 may 
already have left the ‘pack’ earlier, having ceased to 
attempt items earlier than No. 44. If 32 of the 42 
answers were correct: 


mom 


Fa = 14 = 46 


0-70 or 70% 


(b) Discrimination, D 

This measures how well the item separates the better 
candidates from the weaker ones. There are many 
such measures available, the one described here 
being the simplest. To obtain this, divide the pack 
of answer cards into three equal groups, after 
arranging them in order of merit. If they do not 
divide exactly by three, either put the extra one in 
the middle group, or if there are two extra, include 
one more in each of the top and bottom groups. 


36 candidates. 12 top, 12 middle, 12 bottom 
37 candidates. 12 top, 13 middle, 12 bottom 
38 candidates. 13 top, 12 middle, 13 bottom 


(i) Discard the middle group of answer cards. 

(ii) Count the number of correct responses for each 
item in each of the top and bottom groups, for 
those who attempted the item. 

(iii) Calculate in effect the facilities for the top and 
bottom groups respectively, as proportions: 
call them Fy and Fg. 

(iv) Subtract: Fr— Fg= D, the discrimination. 


A practical procedure for both measures, is to 
divide the pack as above initially and record the 
NUMBER of correct responses in each group initially 
and as they appear the number of ‘unattempted’ 
items for each group (see table below). 


Top Middle 


Suppose there are 60 candidates: 20 in each group. 
For item 10 say: 18 are correct in the top group, 
0 ‘unattempted’. 


18 
Fy = 90^ 0:90 

In the middle group 16 are correct, no ‘unattempted’. 
In the bottom group of 20, only 9 are correct and 5 
are ‘unattempted’ (9+5=14: the remaining 6 are 
wrong). Among the 6 could be some who have 
omitted this item, but they would have answered 
later items and thus do not count as ‘unattempted’. 
Hence: 


9 9 
a — i) 
ix cri mi 
For the total 
R = 1841649 = 43 
U= 0-045 = 5 
N-U = 60-5 = 55 
Hence 
F= = = 0-78 for the item 


Meanwhile F,—F,=0-90—0:60=0:30=D, dis- 
crimination for the item. = 

Note that in the bottom group, since there are 5 
‘unattempted’, 5 candidates among those 20 have 
already stopped answering items. Hence they fail 
to answer items 11, 12—up to 50. Later others may 
stop answering items. Thus this figure of 5 in the 
‘U’ column, can only increase. If it increases too 
rapidly and particularly if the numbers in the top 
and middle ‘U’ columns grow, there is a strong 
suggestion that insufficient time is being allowed for 
the test, which is thus acting as a speed test. 


Bottom Total 
— 
Item =D 
NS R| U Fr R| UJ|R Fa "2 5 Fr- Fa 
10 18 9 
—=0-9) 43 30 
say | 18 | 9 [535799 | 16 | 0 | 9 15709 | 4 | s 3-078 0 


34 


Part 5. Compiling a Test for Class-room Use 


A teacher has a great advantage over a public 
examinations board. He knows personally the 
qualities of his pupils, and of course is dealing with 
a relatively small number of them. Further, he can 
test his class at any point in time that he wishes, and 
can thus control the content of his test to suit his 
Own teaching programme. 

There are two situations which arise in tests within 
Schools. The first is the specific short test, whose aim 
I$ to measure progress perhaps in two or three weeks 
work. The second is the longer test, designed to 
assess the knowledge and progress of the class during 
a term or for the whole academic year. 

For the short test, 20 items may be adequate, and 
they should also be fairly easy, since the main 
Purpose of the test is to ensure that the whole class 
has understood the work covered. Thus an average 
mark of 16 out of 20, or even higher, would be 
expected, The discrimination of such items will not 
be high, but this is unimportant. Perhaps four or 
five items in the 20 should be more difficult, covering 
Points of interpretation or inference, which only the 
best pupils are likely to grasp without further 
T evision. After the test, which can be rapidly marked, 
it is useful to count how many pupils have answered 
each item correctly, the ‘facility’ of each item, 
expressed as a percentage of the total number of 
responses, since there are likely to be no unattempted 
items. Particularly for the harder questions, where 
the facility may well drop to 50% or so, it is also worth 
checking that the majority of the better pupils are 
Correct, the majority of the weaker ones wrong. 

By calculating the discrimination D for each item, 
this measure is given in a standard method. A longer 
test, of perhaps 50 items, needs more careful 
Planning. Clearly it is not essential to adhere 
Tigidly to the procedure described in Part 2, fora 
full test schedule is not needed. Yet there has to be 
an assessment of the content of the term’s, or 
Perhaps year's, work so that items can be correctly 
assigned to each section. There will probably be a 


written paper too, and if the same teacher is re- 
quired to set both papers, it is easier to co-ordinate 
them. 

After such a test, it is even more desirable to 
evaluate the facility and discrimination of each item, 
particularly as the same test may cover several 
streams at the same level in the school. This shows, 
by comparing average marks, how far one stream is 
ahead of another, and also allows for a different 
pass-mark to be fairly selected for each stream. 

If the marks scored in each subject used in a class 
examination are, as so often happens, added 
together to give an overall class position, then in 
justice to all pupils, the marks should be reduced to 
a common standard before adding them. If, how- 
ever, the school report merely records the position of 
a pupil in his class, subject by subject, then there is, 
of course, no need to adjust any marks. The tech- 
nique of standardising marks is not difficult, and is 
essential if any attempt is made to add scores. 

This may be an appropriate point to discuss the 
effects of ‘coaching’ on the results of objective tests. 
It is a common criticism that one of the defects of 
such tests is that the scores of a group of pupils can 
be materially improved if they receive special 
coaching on similar tests. To some extent this is true, 
just as extra training in any field of learning improves 
the chance of success. The essential differences 
however are that, (i) the law of diminishing returns 
rapidly operates, and (ii) the large variety of items 
possible prevents actual ‘question-spotting’. The 
results of coaching have been the subject of intense 
research. 

A clear distinction should be made however 
between the use of objective tests in examinations 
perhaps once a year, with one or two trial runs just 
before the examinations to ensure familiarity with 
the technique of answering and to off-set the normal 
tension of examinations, with their misuse by 
frequent coaching tests throughout the year as a 
substitute for proper teaching. Properly used at 


35 


limited intervals for genuine educational measures 
of progress and attainment, they are usually enjoyed 
by a class or group and stimulate interest. There is 
also the advantage that candidates know that they 
are not going to face the tedium of long essay 
writing. 

Like any other class-room activity however, 
variety is needed to maintain motivation and 
interest. To use tests every day as a class routine is 
not only educationally unsound, but they become 
merely a dull habit which a class passively accepts, 
as indeed they already accept other routines, like 
the long drone from Mr. X day in, day out, and the 
‘usual’ History essay once a week for homework. 

There is also some distinction between the so- 
called ‘intelligence test’ and attainment or selection 
tests in school subject groups. Both lead to harmful 
effects if over-used as an alternative to teaching. 
School subject tests in Arithmetic, Geography, 
History, English and so on have some teaching merit 
as an adjunct to their measuring ability, but the 
‘intelligence test’ (or ‘reasoning’ or ‘verbal reason- 
ing’ tests as they now tend to be called) are designed 
as a measuring device alone. Not only are candidates 
not expected to learn from them, but steps are taken 
to eliminate any changes in score due to any 
learning process by repetition of the tests. Hence 
*coaching' on these kinds of test is very harmful. 

The main purpose of this book however is to 
discuss objective tests of educational attainment and 
to compare them with written tests in similar school 
subjects and topics. The findings largely agree with 
each other. Starting with a totally unsophisticated 
group—that is, a group who have neither attempted, 
nor even seen, a multiple-choice test before, an 
improvement of about 5% in the scores is usual on a 
second equivalent test. A third trial, a short time 
later, may increase the scores by about 1750r297 
more. Thereafter the scores remain virtually 
constant. There is also some evidence that after 
regular practice attempts over several weeks, 
boredom and general lack of interest, for the novelty 
of the tests has worn off, lead to a reduction in the 
scores, the weaker candidates losing most of their 
earlier gains. 

The conclusion seems clear enough. Every pupil 
who is eventually to face a set of multiple-choice 
tests should be given one dress-rehearsal, accom- 
panied by generous help and encouragement and 

full explanations of the answering technique. The 


36 


Scores should be recorded, but no conclusions 
drawn, nor should the pupils be encouraged to use 
their scores as evidence of merit order. When the 
initial interest has faded with time, a second test 
should be given. The new scores should be recorded 
and compared. For this purpose it is desirable that 
standardised tests should be used of equivalent 
value. Copies are always supplied confidentially to 
Schools from the national bodies in the United 
Kingdom at a modest cost. Instructions for both 
administration and scoring are available, and for 
deliberate familiarisation purposes, it is better to 
use tests well within the ability of the pupils. The 
correlation between the two sets of scores should be 
evaluated—it is certain to be quite high—but 
particularly large variations, either way, for in- 
dividual pupils should be noted. A third—and 
definitely final—test can be given if the scores on the 
second test increase by much more than 5% over the 
initial scores. A third test too, helps in examining 
any individuals whose results have been unusual in 
the earlier trials. 

It is good practice to ensure that each potential 
candidate for selection is provided, through the 
school, with a printed practice-test sheet. At higher 
levels, because pupils have already experienced this 
earlier examination, they are already, at least in part, 
‘sophisticated’ in the test sense. Even so, a practice 
test always precedes the final test. Experience has 
Shown that it is rare indeed, even with tens of 
thousands of candidates who take objective tests, to 
find one who has failed to understand the simple 
instructions. The usual situation is that the can- 
didate leaves his separate answer sheet untouched 
and marks his selected options on the examination 
paper itself. When this is sorted out and marked by 
hand, in every case the mark is very low indeed, and 
in the accompanying written paper too the result 
has been a bad failure. Each time such a situation 
arises, perhaps half a dozen cases in as many years 
from some 40,000 candidates per year, it has clearly 
been the work of a very weak candidate who has 
failed on all papers. Thus a single trial for all 
candidates levels out the major change due fo 
coaching, and a second trial, even if only some 
candidates received it, would hardly affect the 
overall results. What, however, must strongly be 
condemned, both on educational and psychological 
grounds, is the misuse of past objective papers either 
in a misguided attempt to ‘coach’ pupils by hoping 


that they will learn items by memory, or as a 
substitute for positive teaching. Indeed not only 
does such attempted coaching not succeed in its 
object, but it destroys one of the main merits of 
objective tests. This is that objective tests usually last 
for 75 minutes at most, and candidates look forward 
to them, whereas a two to three hour paper fre- 
quently tends to depress them with the thought of the 
Physical effort of writing alone. Most normal 
candidates are tense and rather nervous in the 
examination rooms. Exaggerated tension is harmful, 
but a slight degree of tension is probably beneficial. 
It provides a stimulus to the candidate to do his best. 
A completely relaxed candidate is rare and is usually 
the result of massive over-confidence; the resulting 
Output often reflects a carelessness and lack of 
attention to neatness which is quite typical. 

Ina multiple-choice test the candidates know that 
the correct answer lies before them and are usually 
quite sure that they cannot fail to select it in each 
item. Their keenness and general motivation is at 
their highest, they know that there is not much 
Teading to do, save possibly in comprehension 
Passages associated with language studies, and no 
real writing at all. In a simple class-room test there 
is not much opportunity to select items which do 
More than measure progress in relation to the topics 
being taught at the time. As stated above, the 
Average mark should be high, the items fairly easy. 
The basic problem is to find out whether all the class 
have understood and learned the work taught in a 
limited time, 

For term or annual tests, both longer and harder 
tests are appropriate, an average mark of about 50 7; 
Should be aimed at, by choosing items whose 


facilities range from about 20% to 80%, but 
average around 50%. It is important too to look for 
items with good discrimination. Whereas in a class- 
room test, because most pupils should answer most 
items correctly, discrimination is bound to below, as 
low as 0-20, it should not be allowed to fall below 
0:40 for an annual test on a year's work and if it can 
be kept to a minimum value of 0-50 for all items, so 
much the better. 

This however is a counsel of perfection. In a 
school it is not usually practicable to try out say 
200 items on each subject and analyse them and 
select the best 50 for use on exactly the same pupils 
shortly afterwards, nor is it easy to guarantee that a 
well-chosen set of items can be kept secret year after 
year in the hurly-burly of most schools' daily life. 

All that can be reasonably expected is that, as a 
result of class-room tests, a teacher has accumulated 
not only a stock of items to which he knows the 
likely Facility and Discrimination, but more 
important a stock of ideas to help him compile a 
longer test for use at the end of the year. 

After the examination is over and the papers 
marked, an item-analysis of the results is valuable. 
It shows which items proved to be effective and 
there is an advantage in storing good items in a 
private file, for re-use after an interval of a year or 
two, mixed with new items. 

The simple analysis outlined above does not 
allow items to be divided into diagnostic or predic- 
tive items separately. The more detailed analysis 
given later does allow this separation, but this also 
presupposes a testing programme covering many 
thousands of candidates and is outside the scope of 
class-room tests. 


37 


Part 6. Statistics Needed for the Analysis of Objective Tests 


This section will be sufficiently detailed for those 
with mathematical interest but no statistical 
knowledge, but no attempt is made to present either 
Proofs or to follow a rigid mathematical develop- 
ment. Each stage of a calculation is shown and a 
simplified example is given for each type of calcula- 
tion to illustrate the steps needed. It is very useful if 
asimple calculatin machine is available, and helpful 
for some calculations if a slide rule can be employed. 
For class-room use a very simple procedure is given 
Carlier in Part 4 as a part only of the more detailed 
analysis needed for larger examinations. 


1. Notation 


To avoid repetition, all of the notation used is listed 
here. The letters X, Yare used to denote test scores 
on two different tests, or perhaps as two parts of the 
Same test. Each candidate has an actual score on each 
test and these will be particular values of Xand Y for 
that candidate. The letters are generalised for any 
individual score. The Greek capital Xj (sigma) is 
used, as standard practice, to denote summation. 
Thus Y: Y means ‘the total of all the marks scored 
by candidates on this test’. 


T=number of candidates taking the test 
n=number of items in the test 
5,—the standard deviation of the marks, X, on 


the test 

=the reliability of scores on the test, whose 
marks are X 

€, =the standard error of scores on the test, whose 
marks are X 


(s, Ty, €,, are the same quantities relating to a 
Second test, whose scores are Y.) 
F— facility of an item; i.e. F29— facility of item 20 
R-— double tetrachoric correlation coefficient for 
an item of the test, i.e. Ro coefficient for 
item 20 


=the covariance between the marks, X, on one 
test and the marks, Y, on a second test 

r,,7 product-moment correlation coefficient be- 

tween the scores on the two tests, whose 

marks are X and Y. 


Say 


2. The average, or mean mark on atest 


This simple statistic has been discussed above in 
Part 4. 


3. The standard deviation of a mark 
distribution 


This is a measure of the spread of marks above and 
below the average mark. In a large population many 
of the measurements associated with natural 
features or abilities are distributed in a form which 
is called a ‘normal distribution’. Examples are the 
heights of any large group of people and, similarly, 
the academic ability of children, allowing for the 
effect of different ages. Graphically, such a dis- 
tribution produces a ‘bell-shaped’ curve: 


Fig. i 


Average mark 


(o) 


This is symmetrically placed around the average 
mark, and at two points, A, on either side of the 
average, there are points of inflexion which are 
mathematically convenient points to use as a 
standard by which the width of the curve at any other 


39 


point can be measured. The distance AX, i.e. from 
either point to the central line, is called ‘the standard 
deviation’ for the curve. It is a deviation of marks, in 
this case from the average. For example, at a 
distance approximately 2s, above or below the mean 
about 24% of the total area is cut off in each of the 
tails: 


(See Table 3 in the Appendix: 24% gives c= 1-960, 
approx. 2.) 


It can thus be said that a total of 5% of the area lies 
outside a distance of twice the standard deviation 
from the mean. This is a common reference level. 
Applied to a mark distribution, for example, 24% 
of the candidates are likely to score above this mark, 
and another 24% below a corresponding low mark. 
The standard curve has been tabulated to a high 
degree of accuracy and, using such tables, any 
percentage of ‘area’ can be read off in relation to the 
distance above or below the average in terms of 
units of standard deviation. Useful points are: 
1s, above or below the average includes about 68% 
of ‘cases’ in the band, 2s,, as stated above, includes 
95% of cases in the band, and 3s, includes 99% of 
cases in the band, i.e. 


4% lie over three standard deviations above the 
average mark 


24% lie over two standard deviations above the 
average mark 


As an example, suppose 1000 candidates take a 
100-item test, designed to give an average mark of 50. 
If the standard deviation is, say 15 marks, then 680 
candidates (68%) will score between (50—15) and 
(504- 15) marks, i.e. 35 to 65 marks. 
Taking twice the standard deviation, ie. 30 
marks, then 950 (95%) of scores will lie between 
(50—30) and (50+ 30) marks, 20 to 80 marks. 


Finally, taking three times the standard deviation, 


40 


ie. 45 marks, then 990 (9997) of scores will lie 
between (50—45) and (504-45) marks, i.e. 5 to 95 
marks. 

We can thus draw up bands of marks: 


1.C. 
96 and over 5 4% above 35x 
81 to 95 20 
66 to 80 135 ii 
50 to 65 340 1s, 25, 35x 
35 to 49 340 la 9$) 105%) (9979 
20 to 34 135 M 
51019 20 M 
4 and under $ 394 below 35x 


1000 candidates 

This illustrates two points: first that a theoretical 
mark distribution can be predicted from the trial 
testing of objective tests and can then be compared 
with the actual results; second, how even a good 
spread of marks still leads to heavy bunching of 
marks near the middle of the range. The standard 
deviation of 15 is that used by a large number of 
properly validated and standardised tests. Some- 
times a mean of 100 is used, with standard deviation 
of 15 as for Intelligence Quotients. The calculation 
of a standard deviation is not difficult, but is 
materially helped by a calculating machine. 

The formula for the VARIANCE of a set of scores is 
given by 


g a NEX- (D 
^" —— NN-) 


—— 


It can be seen that the VARIANCE S? is the square of 
the STANDARD DEVIATION, Sy. 

The denominator of the above expression 
N(N-—1) is important if N is fairly small—say 30 
or under. If N is larger than this, then the value of 
N(N — 1) is not very different from N? and the above 
formula can be reduced to: 


Sg.EZT og. Q) 


The advantage of formula (1) is, that on a calculating 
machine, the quantities (Y; X?) and (X; X) are 


available readily. The former: X; X?—sum of the 
Squares of individual marks and the latter (5; X) 
=the sum of the individual marks; of course X, the 


mean mark— Q x. 
N 


To illustrate the steps involved a very simple example 
is used. Suppose there are 10 marks of 


8, 10, 12, 14, 18, 20, 21, 23, 25, 29 


Then 
Xx- 8+10+12+14+18+20+21+23+25+29 
= 180 
yu LX _ 180 


"WP em 18 marks 


X X? = 8241024 122+ 142+ 18242074 21? +23" 
+252+29? = 3664 


Using formula (1) 
2 10 x 3664—1807 _ 4240 
m 10x9  10x9 
= 4711111 


S, = VATTI — 6:864 


If however, formula (2) were used here: 


S$ = M jg? = 366-4—324 = 424 

and S,— 65115 which can be seen to be rather too 

Small. This illustrates the need to use formula (1) 

if N is small. - 
There is yet another version of formula (1) which 

may be useful if the average mark is exactly equal to 

à whole number—or may be so taken without loss 

Of too much accuracy. Then 

xq-x* (la) 
(N—1) 

Applying this formula to the above example, we 


see that Y— 18. Here we subtract 18 from each score 
before squaring, i.e. 8, 10, 12, etc. become 


2 
x 


=10, =8, 8,4, 0, 2; Ea 
(ie. (X — X) 


Now square these: 100, 64, 36, 16, 0, 4, 9, 25, 49, 121. 
(ie. (X— X?) 


Add them: X; (X— X)? = 424 


S= Hi 47-1111 
9 
as before; S,— 6:864 as before. 

Some time has been devoted to illustrating the use 
of these formulae since the standard deviation is a 
very important statistical measure for any mark 
distribution. 


4. Skewness, quartiles and percentile 
score 


In many examinations the marks of the candidates 
are not symmetrically balanced around the average 
mark, although overall the mark distribution is 
approximately normal in its general shape. This lack 
of symmetry is termed skewness and is characterised 
by a long tail of either very high, or very low marks, 


eg: 


Fig. iii 


Low X 
POSITIVE SKEW 
X above Mode 


Low X X WM High X 
NEGATIVE SKEW (The median is always 
X below Mode between the Mode 

«Kibeiow Vo and the Mean) 


The peak of the curve occurs at a score called the 
Mode (M) of the distribution. Alone, the mode 
score tells us very little. In each case above, the 
average score, X, is no longer at the mode, as it is 
with a symmetrical distribution. 

A third measure of statistical value is the Median 
(M,). This is the score which divides the order of 


4l 


merit into two equal halves, so that half of the 
candidates have scores below the median. The 
median and mode scores have little relevance to the 
results of only a few candidates where in any case it 
is easy to calculate quickly the average mark. 

With several hundred candidates however, the 
order of merit inevitably contains many candidates 
with equal scores. It is then easy to pick out the 
median score for these candidates and by a simple 
extension of the idea of a median score decide 
whether the distribution of marks is skewed or 
symmetrical and if the former, make an estimate of 
the degree of skewness present. 

When a list of candidates’ marks is arranged in 
this way in an order of merit, we can calculate what 
are known as ‘percentile scores’. To do this we 
‘accumulate’ the number of candidates scoring each 
mark from the bottom of the order of merit. 


AN EXAMPLE 
Score No. of Cumulative Percentile 
candidates No. from 
with bottom 
this score 
49 9 1000 100 
48 9 991 99-1 
47 11 983 98:3 
46 20 972 97:2 
45 24 952 95-2 
44 36 928 92:8 
43 42 892 89-2 
42 48 850 85-0 
41 52 802 80:2 
—40 56 750 Upper quartile < 75:0 
39 84 694 Q5 69:4 
38 110 610 61:0 
37 96 500 Median score <- 50-0 
36 80 404 Me 40-4 
35 74 324 32-4 
—34 60 250 Lower quartile < 25-0 
33 51 190 Qi 19:0 
32 47 139 13:9 
31 36 92 92 
30 28 56 56 
29 14 28 28 
28 9 14 1-4 
27 5 5 0:5 
Total 1000 s= 


The method is to ‘accumulate’ candidates from 
the lowest score, upwards, e.g.: 


42 


5+ 9= 28 € 


14414 = 28 
284-28 = 56 
56436 = 92 


92--47 — 139 etc. 


These cumulative totals are divided by N, the 
number of candidates and multiplied by 100, to 
obtain the percentile scores. 

For simplicity of explanation, the number of 
candidates is here 1000, hence the percentiles are 
one-tenth of the cumulative numbers at each score 
level. Note that the 50th percentile is at 37 marks. 
Hence 37 is the Median score. Similarly 34 is the 
25th percentile or lower quartile score and 40 is the 
75th percentile or upper quartile score. 

Because in this example the gap: 


Q3— M, = 40—37 = 3 marks, 
and the gap: 
M,—Q, = 37—34 = 3 marks, 


the distribution is reasonably symmetrical. 
The total gap of Q;— Q,=6 marks. 
9Q3—- Qi 


Half of this, i.e. P rm 3 marks and is denoted 


by Q and called the ‘semi-inter quartile range’. 
A simple formula used for measuring skewness is 


Qi Q3-2M 
Sk uu e 
ewness 20 
Here this gives: 
Skewness — 34-+40—2 x 37 =0 
2x3 z 


because the median is symmetrically placed between 
the upper and lower quartiles. 

In this example, the Mode is at 38 marks since at 
this score, there is the greatest number of candidates. 

The cumulative percentage table, or table of 
percentiles can be plotted as a graph against the 
Scores and this produces a curve known as an ogive 
or sometimes as a ‘sigmoid curve’. Fig. (iv) shows à 
perfect ogive corresponding to a perfect normal 
curve. For convenience of use, the vertical scale is 
inverted with the 0% at the top and the 100% at the 
bottom. This allows marks to be accumulated from 
the top, rather than the bottom of the mark distri- 
bution. It can be read for any mark as follows: 


Standard deviation =15 Mean = 50% 


m 10 20 30 40 50 60 70 80 90 100 
E H Hn Se 
t } i 
i H HHH iH 
HE f T H i i T : 1 
i mn i o 
: È BER HEEE i i HHEH EBEREEER: HEHH 
LE I I 4 HH ro HH H tt H 
20 HEE ; NSA EHH H 
H H HE HH EHGHUES HH 
Upper i TEE H HH HEHEHEH 
- quartile I à ; H FE 1 i 4 
$ 30 FEH H : HH ba 
£ i Hu d 
5 H Hm HE 
E i HEHH c 
g ao Sa t i 
= " i " 
z EEHEEHE + t : He 
j i d HHHH H p: à HER 
S Median $ H HH gas H HEH 
3 50 : + + Hi i à cm H 50 
8 aH PEE EEE 
p OHHH : H Hp 60 
E $ i Hu 
$ B HUE FERE H HFHH BEE HEHE EEE 
TORRE HHHH ! : HH 70 
Lower HH HHHH H LE H à H Ri 
quortile , rrr Fri H mi HEHH H j- HH NH 
"e | ide 
90 H : H Hit Hoo 
HE i i HH H 
FEFE : 
100 3 E FEEEEBIOO 
o [9] 20 30 40 50 60 70 80 90 
= Percentage marks 
Actual marks 


Consider 70 marks out of 100, i.e. 10% of the 
total marks (horizontal scale). The curve at 10% 
of the marks cuts the vertical scale level at 
95%. Hence 9-5% of the candidates for a perfectly 
normal distribution should score 70% of the marks 
Or better. This particular curve is calibrated 
for an average mark of 50%, whence by sym- 
Metry, 50% of the candidates score this mark or 
better, 


Such a curve can be used for any large examination 
by drawing on it similar ogives for each examiner in 
any subject. The discrepancy between examiners can 
thus be seen and if each has a fair sample of the 
scripts, their marks can be adjusted to any agreed 


standard. 
Scores on objective tests can be similarly adjusted 


to agreed standards with such ogives, to obtain 
parity between different subjects. 


43 


5. Accuracy of working 


An examination body dealing with many thousands 
of candidates naturally lays down agreed statistical 
procedures at all stages of its work. In a school or 
class-room test or examination, it is not essential to 
lay down rigid procedures and approximate methods 
are sufficiently accurate. All statistical measures 
contain some margin of error due to sampling 
variation and the lack of absolute accuracy in marks, 
particularly those arising from the marking of essay 
type answers. Hence although calculations should 
carry several decimal places where these arise, the 
results should be recorded at most to one decimal 
place. Indeed the nearest whole number for average 
marks is usually adequate. 


6. Item analysis 


We now turn from statistical measures applicable to 
the whole range of marks for all candidates, to those 
measures which are used for the individual items 
from which the test is built up. These measures are 
contained within the phrase: item analysis. 

This is a vast topic on which several books and 
many research papers have been written. To attempt 
to reduce the discussion to a few lines may be over- 
ambitious, yet the basic principles are clear enough. 
The object of item analysis is to answer the question: 
‘What does this item contribute to the overall result 
of the test ? If it can be shown that this contribution 
is negligible, the item can be eliminated. The 
surviving items will be a shorter and more efficient 
test, although to preserve adequate coverage of a 
syllabus, the eliminated items may have to be 
replaced by others proved to be effective. If several 
tests are tried simultaneously, item analysis enables 
the most efficient items to be combined into a single, 
powerful test. What are the criteria to be used in 
Judging an item? There are basically two: first, 
it must be of reasonable facility, i.e. the number of 
candidates who succeed in answering the item 
correctly must be neither too few, nor too many; 
second, the item must assist sufficiently in dis- 
criminating between good and poor candidates. 

In a class-room test of, say 20 items, used to 
measure progress over a few weeks, the teacher will 
expect about 75% of his class to score at least 15 
correct out of the 20 items. There is then no point in 
carrying out a detailed item analysis. In a larger 


44 


test, say of 50 or 100 items given to the whole of a 
school year of several streams, perhaps up to 120 
pupils in all, some item analysis is very desirable 
even if, as a result, the test is only re-scored eliminat- 
ing poor items. If the items can be kept secret, by 
collecting all of the used papers in again, those shown 
to be effective can be saved for later use in a sub- 
sequent year, enabling year-by-year comparisons to 
be effectively carried out. 

(i) The facility, F, of anitem is simply the proportion 
of candidates who answer correctly measured 
against the number who attempt the item. Towards 
the end of a test, the weaker candidates may well 
have left all of the items unattempted through 
shortage of time. These are not counted in the total. 

Thus facility, 


.. No. of candidates who answer correctly 
No. of candidates who attempt the item 


If each item is considered, there are four possi- 
bilities: 


(i) Attempted correctly R 
(ii) Attempted, wrong answer W 
(iii) Omitted in the body of the test [^ 
(v) Unattempted, and all following items 
necessarily unattempted U 


The difference between an omitted item and an 
unattempted item is that the former is followed by 
attempted items, and the latter is followed by further 
blanks, to the end of the test. The timing of the test 
should be designed to reduce unattempted items to a 
minimum, although of course, there has to be a 
practical limit, and very weak candidates will still 
leave some unattempted items. 

Thus: 


R ee (3) 


Fe3cHE-Q TU 


where T — number of candidates. 

Practical limits for F are from 0-30 to 0-80—i.c- 
30% to 80% correct. There should however, be more 
items in the middle of the zone, and for five alter- 
native answers a good working average facility is 
about 0-69 and up to about 0-74 for four alternative 
answers (see Lord, F. M. (28)). The reason for not 
using the accepted ritual of 50% facility is that by 
making items rather easier, the weaker candidates 
are still likely to have some idea of the items and are 


less likely to guess. Hence, by lowering the incidence 
of chance scores among the weaker candidates, 
there is more likelihood of an overall gain in 
selectivity and thus in reliability. The several harder 
items with facilities down towards the 30 % limit will 
eliminate these weaker candidates sufficiently to 
produce a useful mark spread. Clearly, if F=1-0 or 
indeed F=0, neither type of item can possibly help 
in discrimination, since in the first case all can- 
didates answer correctly, and all scores increase by 
1; in the second case, no candidate answers correctly 
and the item is therefore merely a time-waster. 

The second part of the general question is: 
‘Which group of candidates answer the item 
correctly ? It is at this point that the theory of item 
analysis becomes involved, as there are several 
Statistical measures available. The simplest of all, 
and perfectly adequate for class-room use, is the 
evaluation of the DISCRIMINATION, D. To calculate 
this, simply divide the order of merit by test scores 
into three equal parts: the top third, the middle 
third, and the bottom third. Count the number of 
Pupils in each of the top and bottom thirds who 
answer an item correctly from those attempting it. 
It is in effect the calculation of two separate facilities, 
using formula (3) for each of the upper and lower 
thirds. The difference between them is D. 

Example 5: A class of 31 boys take a test. The top 
ten score 38 or more out of 50; the bottom ten score 
18 or less out of 50. (The eleven in the middle group 
are not counted at this stage.) 

For item 26, say, eight in the top ten are correct: 


F = 0:80 


Four in the bottom ten are correct, but two others 
did not reach the item and for them it is unattempted. 


4 R 4 
e ice. ——— | = = = 0:50 
F=- (ie. 7) 8 0 


D = 0:80—0-50 = 0:30 


This is an adequate value—just—for D, which 
should be as high as possible (maximum=1) and 
Not less than 0-30 for an item to be sufficiently 
Selective of the better boys. Note that if the *un- 
attempted’ 2 in the bottom third are NoT allowed for, 
F becomes 0-40 and a spuriously high value of D is 
then obtained. 

For class-room use, item analysis can well stop at 
this point, in recording F and D for each item. For 


larger purposes however, a continuation of the 
same method, but using also the middle third, is 
desirable. The simplest way to introduce the 
technique is by a numerical example, followed by a 
short discussion on its general application. Tables 
are needed to complete the analysis and these are 
printed in Appendix (2), Table 1, for use in item 
analysis. The count needed is entirely within the 
thirds of the order of merit already separated. It is 
useful to prepare a set of blank cyclostyled sheets, 
since one sheet is needed for each item. On this sheet 
is recorded how many candidates in each third use 
each answer option, or omit, or fail to reach the item. 
The sheet, when completed, looks like this: 


Answer options 


A BC) D|EJO|U| T | T-U|P1/23| P12/3 


C is ringed as the correct answer. Thus, 20, 18, 5 
respectively correctly answered this item. Four 
candidates in the bottom group omitted the item 
(O), but attempted later ones in the test. Two and 
four candidates in the middle and bottom groups 
respectively did not attempt this item (U column) 
and these six candidates would have attempted the 
preceding question and left this and all later ones 
blank. There were 104 candidates in the whole 
group, of whom 94 attempted this item: (T— U) 
column. 
43 total correct 


94 "ee total attempted 


The facility, F= 


The columns headed P1/23 and P12/3 are propor- 
tions related to the option C and the numbers in the 
(T— U) column. 

In the P1/23 column, the 0:59 is the proportion 
correct in the top third only — 


45 


Similarly, the lower figure in the P12/3 column of 
0-17 is the proportion correct in the bottom third, 
Tes 

5 


E 0:17 (to two decimal places) 


The difference between these of 0-59 — 0-17 —0-42 is 
the value of D, the discrimination discussed earlier. 
Clearly, if F=0-46, D=0-42, the item is selective 
enough and of average difficulty. It is therefore 
acceptable as an item. 
The remaining two decimals of 0:36 and 0:56 in the 
last two columns are obtained by using two of the 
three thirds together. Note how they spread over 
two lines in the table: 
The value 0-36 is the total proportion correct in the 
lower two thirds, i.e.: 
18+5 23 

Mox wp S 
and similarly the value 0-56 is the total proportion 
in the upper two thirds, i.e. 

204-18 38 

54134 7 gi " 258 
Using Table 1 in Appendix (2), the pair in each 
column is read off as a correlation coefficient, i.e.: 


0:59 and 0-36 in the table: read off as 0-34 
0:56 and 0-17 in the table: read off as 0-59 


The final value of the double-tetrachoric coefficient 
is obtained by averaging these: 
0:34-- 0-59 
2 


(using two decimal places, without correction). 
What has been gained from this extra work? 
Firstly, a lower limit for the double-tetrachoric 
coefficient is about 0-40, hence at 0-46 the item is 
acceptable. Occasionally an item is acceptable by 
this standard, but would be rejected by the cruder 
test of the discrimination, since the middle third can 
make quite a lot of difference. However, more 
important is the further knowledge of the behaviour 
ofthe item. It will beseen that in thelast two columns 
of the table 0-59 and 0-56 are close, yet 0-36 and 0-17 
are not similar. This is reflected in the table readings 
of 0:34 and 0:59 respectively. If the second value, as 
here, is considerably greater than the first, it shows 
that in effect the item is rejecting mainly the weakest 


=046=R 


46 


candidates, but the top two-thirds can answer it 
correctly. This item, therefore, would be a useful one 
in any test where about 60 % of the candidates are to 
be passed. If alternatively, the recorded figure in the 
P1/23 column had been greater than that in the 
second column, the item would then be selecting the 
top third, and both of the lower thirds would find it 
difficult. Thus, such an item is of more use in a test 
designed to select the top 25% or so of candidates 
such as, for example, the Common Entrance 
Examination for Secondary Schools, or the ‘11-plus’ 
transfer examination. 

Where the two figures are approximately equal, 
the item is usually called a ‘grader’ and is uniformly 
selective throughout the range of ability. In building 
up a final paper, by judicious selection of items using 
all of this data, the mark distribution required can 
be controlled, within the normal limits of experi- 
mental error, so that the main purpose of the 
examination can be achieved and the selectivity of 
the test can be adjusted to be most effective at 
whatever the expected or required pass-rate is to be. 
It is worth, as an exercise for the reader, showing 
that a pair of items, one selective of the top third, 
the other as in the example, rejecting the bottom 
third, has the same effect as two ‘grader’ items. 

In conclusion, many of the techniques which 
depend on selecting only some percentage of the 
top and bottom candidates fail to allow for this 
possibility of bias in an item, produced by the middle 
group. The simple discrimination D, is in this 
category, but has the merit of great simplicity. 

By recording all of the answer options, it is 
possible to see whether any option is failing in its 
task of distracting. In the example, Option B is so 
little used that it could well be omitted, if possible, 
or a more plausible answer provided. It also reveals, 
occasionally, that an item is unexpectedly difficult 
because some other wrong answer is attracting too 
many candidates. Possibly such an answer is à 
common mistake made by students, but usually 
there is some flaw in the wording of the question 


which allows ambiguity to enter. Certainly such an 
item needs rewriting. 


7. Reliability of a test 


This is a measure of the consistency of test scores 
between one administration of a test and the next. 


There are three basic ways of measuring reliability. 
The first is called the test-retest method and needs 
two forms of the same test which have been shown 
to be equivalent. The correlation coefficient between 
the two sets of scores for any group of candidates is 
a measure of this reliability. The second is called the 
‘split-half ' method and is used when only one test 
is available. This test is imagined as two equivalent 
halves, a common method of dividing being to take 
all odd-numbered items as one half and all even- 
numbered as the other half. The correlation between 
the scores is then ‘boosted’ to represent the reliability 
on the total test by a particular case of an empirical 
formula, called the Spearman-Brown formula: 


Where r=original correlation between the halves 
and r, is the final reliability. 

This method is useful in class-tests because it is 
€asy to arrange the test paper in two parts, or to 
mark the items by hand in two groups so that each 
Pupil has a score for each of the two ‘halves’. The 
method of obtaining the correlation coefficient 
follows later. 

The third method of measuring reliability is to use 
the internal consistency of the test, by which each 
item is considered as assisting towards the final 
Overall order of merit to a degree dependent on its 
Own variance. There are two possible formulae, 
based on the work of Kuder and Richardson in this 
field (27). The first needs the value of the facility, F 
of each item to be evaluated, as in the section on 
"Item Analysis’ and the standard deviation, Sx, also 
Needs to be known. Then the first formula is: 


ZCP) W 


x 


"TOT n 
Reliability = rų = mes (1 


(n number of items in the test). 

In practice, if the values of F are all almost equal, 
then the above formula can be simplified. If, for 
example, F=0:50, (1—F)=0:50 and for this item 
F(—r)-025 

If Fis either 0-40 or 0:60, then F(1—F) 0:24 
Hence, if F for all items lies between 0-40 and 0-60, 
the error in assuming that all values of F are equal 
is slight. 


But Y; F— X, the average mark, hence (4) can be 
changed to 


AS (4a) 


* n=l ns? 


This formula only needs the average mark, Y, and 
the standard deviation, s,, to be evaluated, and this 
is automatically done at the first stage of calculations 
on test scores. The formula (4a) will always give a 
rather low value of reliability, but it is rarely more 
than 0-03 below that given by formula (4). For 
class-room use it is quite adequate. 

Example 4: Suppose a 50-item test has facilities 
for ten items each of 0-20, 0-30, 0-40, 0-60, 0-70, and 
a standard deviation of eight marks. 


Using formula (4): 
F 1-F F(1—F) 

0-20 0:80 0-16 
0:30 0:70 0:21 
0:40 0-60 0:24 
0:60 0-40 0:24 
0-70 0:30 0:21 

Total = 1-06 


multiply by 10 (10 items each): 10-6=> F(1-F) 


n 50 
ENT MES e Rus oe 
Ec ae Soe a 
50/, 106\ 50 534 _ 
n (I- Sr) = 8" aa i 


Had it been assumed that all facilities were equal 
and that the average mark had been 22, say. 
Formula (4a) gives: 
_ 50 | 
se 50x 64 


S0 eo 
= 49 50x64 


50. 323 
= 39 400 ^ 0-825 


The difference is thus 0-851— 0:825 —0-026, which 
is a fairly typical result. 


47 


8. The standard error of a test score 


This expression often puzzles teachers when they 
hear it expressed since, although it is widely recog- 
nised that in the marking of essay questions there is 
quite likely to be some discrepancy between markers, 
the results of an objective test are often regarded as 
exact. This confusion arises from an erroneous 
viewpoint. Essay marks contain two kinds of 
possible error: error due to differences between 
markers; and error due to unreliability of the paper 
itself. The candidates’ daily error for ‘good’ or ‘bad’ 
days is present always. In an objective test, the first 
kind of error is eliminated because there is an exact 
mark not subject to the interpretation of different 
examiners. The second kind of error associated with 
unreliability is still present, but invariably much 
lower than in any essay mark because the test 
reliability is higher than essay reliability. If reliability 
is thought of as the ability of a test to reproduce the 
same score on a second trial, then because a test is, 
say, only of 90% reliability, the 10% error includes 
the different score levels achieved by at least some 
of the candidates on a second attempt. 

If r, is the reliability of a test, then (1— r,) could 
be called the unreliability. These score differences 
from one trial to the second are themselves dis- 
tributed in a similar way to the actual scores, and 
they too therefore, have a standard deviation of the 
‘error’ scores. This quantity is called the standard 
error of a mark (or score) and is defined by: 


ex = 5,V1—r, (5) 


Note that ifr,,=1, e,—0, i.e. a perfectly reliable test, 
because it repeats scores without error, has no 
standard error. Equally if r,=0, a completely 
unreliable test, then e,=s,, i.e. all of the spread of 
marks on the test is due to error, and none to the test 
itself. In practice, for a reasonably reliable test, say 
r,=0-90 or more, e, is around 6 marks in a 100. 
Accepting the conventional level of two standard 
deviations above and below a mean which includes 
95% of all cases, implies that there needs to be about 
12 marks in a 100 between two candidates before it 
can be asserted that the higher score is really 
indicative of higher ability. In other words, if two 
candidates on such a test scored respectively 38 and 
40 marks, it would be wrong to assume, without 
further evidence, that one candidate was better than 
the other. On a retest the same two candidates could 


48 


well score 41 and 35 respectively. However, if the 
mark gap between them were 12% or more, it is 
probable that this order would be repeated on retest, 
i.e. one is likely to be better than the other. 

This appears to contradict the accuracy and 
objectivity of the marking on a test. The mark is 
exact; however it is not necessarily correct, now or 
for all time. If a similar statistic is evaluated for 
essay questions, which is rather more difficult 
because of the lack of precise marking in any case, 
empirical results have often shown reliabilities as 
low as 0:2 or 0:3. Then the standard error is of the 
order of 12 marks in 100, virtually double that for 
an objective test. Thus, the value of 2e, is about 24 
in 100. This implies that really there are only the 
grades corresponding to 0-24 (average 12), 25-49 
(average 36), 50-74 (average 60) and 75-99 (average 
84) which can genuinely be separated. Yet many 
examiners study a script and solemnly change 36% 
to 37% and so on. A very reliable objective test can 
achieve 11 grade-bands in 100 items, although 
because the score is precise, this score is taken in the 
absence of better data. An example of the use of 
logical grades based on test scores is: 


Average: 44 out of 100, say: standard deviation=14; 
reliability =0.86 


ex = 14V1—0-86 = 5-24 


4411048 = 54 and 34 
4442 x 10-48 = 65 and 23 
44+3x 10-48 = 75 and 13 
44+4x 10-48 = 86 and 0 
44+ 5x 10-48 = 96 

(using whole numbers only). 

Thus, in effect, the mark grades on this test are: 
0, 13, 23, 34, 44, 54, 65, 75, 86, 96, and these could 
be the lower bounds of grades classified as 1 to 10. 
This gives a realistic appreciation of the effective 
grading possible from a single test. The combination 
of several tests leads, of course, to a higher overall 


reliability, thus e, is reduced and many more grade 
categories can be distinguished. 


2e, = 10-48 


9. Correlation between test scores 

This is needed, for example, in the split-half method 
of measuring reliability. In effect we have two scores 
for each candidate, for convenience labelled the 
"score and the y-score. Each may be a score on the 


separated halves of the same test, or they may be 
Scores on tests in different subjects. 

Ineach case we would calculate the average marks, 
which can be denoted by x and J, and the standard 
deviations, which can be denoted by s, and s,. 

The new term we need is called the covariance of 
the two scores, and is a measure of the way in which 
the scores are related. To obtain this we first need 
to multiply together the pair of scores for each 
candidate and add the products together. 

As a formula: 


Covariance = s, = 23. y (6) 
For a very large number of candidates there are 
Short-cut methods of obtaining this quantity which 
are found in most standard statistical texts (one for 
example is by *diagonal adding), but for smaller 
numbers, the product sum ( xy) is not difficult to 
calculate, especially if a small calculating machine is 
available. As the formula shows, we then divide the 
Product sum by the number of candidates and 
Subtract the product of the two average marks. 
The product-moment correlation coefficient, r,,, 
is then given by: 


=e (7) 


ly = 
By 


and varies from — 1, through 0 to +1. 
If r,,=0, then it can be assumed that there is no 
relation between the x-score and the y-score for the 
group of candidates. If rv, approaches +1, then the 
relationship is stronger, reaching perfect agreement 
only in the improbable situation of r,,— 1. 

Similarly if r,,=—1, the relationship is again 
Perfect, but opposite, so that a large x-score 
Corresponds to a small y-score, and vice versa. This 
is of course uncommon in educational work where 
good performance in one subject tends to go with at 
least acceptable performance in others, linked as 
Many subjects are, by the underlying general 
intelligence of the candidates. 

In the split-half reliability method if the correla- 
tion between the two halves were, for example, 


r,,770-7, then the boosted reliability for the com- 


bined halves as a single test would be: 
_ ay 2x07 D4 
n= ee d] pj m 


(Spearman-Brown formula). 


EXAMPLE ON CORRELATION 
This is not a realistic example, as the number of 
scores is too few, but illustrates only the steps 
needed for the calculation. 


Scores 


Candidates| x y x y: xy 


Oo -0uarutN- 
ee 
NNPRAIB OBEN 


ey 


Total 65 85 535 811 638 | 


a. Pe ey 
dee an (= 
aq 1053565 _ 12.5 
x 90 — 


| 10x811—85? — 


= 90 pL Ed BEED 
638 ER: 
Sy = qo 105%) = 63:8—55:25 = 855 
Hence 
855 ru 049 


Ty = 3.1358x3°5355 = 


not very high for only 10 sets of scores. 


49 


Selected Bibliography 
General books on Objective Tests or 
Statistical Analysis 


1. 


a 


x 


oo 


No 


10. 


Ti. 


50 


ADKINS, D. D. et al. 
Construction and Analysis of Achievement Tests. 
U.S. Govt. Printing Office, 1947. 


. BLOOM, B. S. (Editor) 


Taxonomy of Educational Objectives. 
Handbook 1: The Cognitive Domain. 
Longmans, Green & Co., 1956. 


. DAVIS, F. B. 


Item-analysis Data: Their Computation, Inter- 
pretation and Use in Test Construction. 
Harvard Univ. Press, Cambridge (Mass.), 1946. 


FERGUSON, G. A. 
The Reliability of Mental Tests. 
Univ. of London Press, 1940. 


GARRETT, H. E. 
Statistics in Psychology and Education. 
Longmans, Green & Co. 3rd ed., 1947. 


GUILFORD, J. P. 
Psychometric Methods. 
McGraw-Hill Publishing Co. Ltd. 2nd ed., 1954, 


. GULLIKSEN, H. 


Theory of Mental Tests. 


Chapman & Hall Ltd., London 2nd printing, 
1958. 


. LINDQUIST, E. F. (Editor) 


Educational Measurement. 
American Council on Education, Washington, 
1951. 


. MCCLELLAND, W. 


Selection for Secondary Education. 
Pub. XIX of the Scottish Council for Research 
in Education. Univ. of London Press, 1942. 


MCINTOSH, D. M. et al. 


The Scaling of Teachers’ Marks and Estimates. 
Oliver and Boyd, 1962. 


SNEDECOR, G. W. 
Statistical Methods. 
4th ed. Iowa State College Press, 1946. 


12; 


13; 


14. 


THORNDIKE, R. L. : 
Personnel Selection (Test and Measurement 
Techniques). 

Chapman & Hall, 1949. 


VERNON, P. E. 

An Introduction to Objective-type Examinations, 
Examinations Bulletin No. 4. 

H.M.S.O., 1964. 


WALKER, H. M. & LEV, J. 
Statistical Inference. 
Holt, Rinehart and Winston (New York), 1953. 


- YATES, A. & PIDGEON, D. 


Admission to Grammar Schools. 

National Foundation for Educational Research 
in England and Wales. Publication No. 10. 
Newne's Educational Pub. Co., 1957. 


Books or Reports of General Examin- 
ation Interest 


16. 


19. 


Research and 
Journals 
20. 


21. 


GRIEVE, D. W. 
English Language Examining. 


African Univ. Press, for West African Examina- 
tions Council, 1964. 


- HARTOG, P. & RHODES, E. C. 


An Examination of Examinations. 
Macmillan & Co. Ltd., 1935. 


- JEFFREY, G. B. (Editor) 


The Effects of External Examinations on the 
School. 


Harrap & Co. Ltd., 1958. 


VALENTINE, C. W. 


The Reliability of Examinations. 
Univ. of London Press, 1932. 


Report Papers in 


BROWNLEES, V. T. & KEATS, J. A. 

A retest method of studying partial knowledge 
and other factors influencing item response. 
Psychometrika, March 1958, pp. 67-73. 


CRONBACH & WARRINGTON 

Time limit tests and estimating their reliability 
and degree of Speeding. 

Psychometrika, 1951, pp. 167-188. 


22. 


23. 


24. 


25. 


26. 


21. 


DANIELS, J. C. 

Testing geography at the ‘O° level of the G.C.E. 
British Journal of Educational Psychology, 
1954, pp. 180-189. 


DEMPSTER, J. J. 

Group intelligence tests: An inquiry into the 
effects of coaching and practice. 

The Schoolmaster, Jan. 3rd, 1952. 


FLANAGAN, J. C. 

General considerations in the selection of test 
items and a short method of estimating the 
product-moment correlation from the tails of the 
distribution. 

Journal of Educational Psychology, 30 (1939), 
pp. 674-680. 


HEIM & WATTS 
Practice and coaching in tests. 
Brit. Journal of Educational Psychology, 1957. 


KELLY, T. L. 

The selection of upper and lower groups for the 
validation of test scores. 

Journal of Educational Psychology, 30 (1939), 
pp. 17-24. 


KUDER, G. F. & RICHARDSON, M. W. 
The theory of the estimation of test reliability. 
Psychometrika, 2 (1937), pp. 151-160. 


28. 


29. 


30. 


31. 


LORD, F. M. 

The relation of reliability of multiple-choice tests 
to the distribution of item difficulties. 
Psychometrika, 1952, pp. 181-194. 


WISEMAN, S. & WRIGLEY, J. 

The comparative effects of coaching and practice 
on the results of intelligence tests. 

Brit. Journal of Psychology, XLIV, 2, May 1953. 


WRIGLEY, J. 

The factorial nature of ability in elementary 
mathematics. 

Brit. Journal of Educational Psychology, 1957. 


YATES, A. et al. 
Symposium on the effects of coaching and practice 
in intelligence tests (5 articles). 
Brit. Journal of Educational Psychology, 1, 2 
(1953); 3, 4, 5 (1954). 
1. YATES, A. 
An analysis of some recent investigations. 
2. JAMES, W. S. 
Coaching for all recommended. 
3. DEMPSTER, J. J. 
Southampton investigation and procedure. 
4. WISEMAN, S. 
The Manchester experiment. 
5. VERNON, P. E. 
Conclusions. 


51 


Appendix 1 
Specimen Answer Cards 


Type 1: 

For use in class-tests, where each pupil selects a lettered answer option and writes his chosen letter in the space 
provided under each item number in turn. The word ‘question’ is preferred to ‘item’ as being more familiar 
to pupils. 


ANSWER CARD FOR OBJECTIVE TESTS 
SONOS —A"-—————-3 


Class, 


Subject Of test... idoneus eter RE TROU 


Pupll's L111 eee eee dM m ee 


Write the answer letter you choose in the space provided under the question number 


Quean)! T s 1 2 3 4 5 6 7 8 9 10 
M MCN Eg. apo gp | 
Answer 


Letter 


Question 
No. 


Answer 
Letter 


Type 2: 
For examination use, when special scoring equipment is available. 


ANSWER CARD FOR OBJECTIVE TESTS 


CLR iis cioe ct Envie tete ee Subject of test... cocer SUE IRAE RE TRAE 


Examination Index No... 


Candidates ritieni tamm eene 


Show your choice of answer for each question by shading in pencil the dotted lines to the right of the number 


of the answer choice. 


53 


Appendix 2: STATISTICAL TABLES " 
Table 1: Double-tetrachoric correlation coefficients 


WN Wl AA AG a S 


SL Sh 56 5$ €0 


04|10 20 25/30 35 40 45 50/53 56 58 61 63166 ga 70 71 73/74 76 77 78 80|81 82 83 84 d 
06| 0 09 14/20 25 30 35 40|43 47 50 53 56] 59 61 63 65 66|68 70 72 73 75177 78 80 8l = 
08 0 07/13 18 23 28 32/36 40 43 47 50|53 ss 58 60 62/63 65 67 68 70/72 73 75 RES 
10 0/05 10 15 20 25/30 33 37 40 43|46 49 si 54 56|58 60 62 64 66|68 69 71 73 74 
12| 0 05 10 15 20|24 27 30 33 37140 43 46 49 si 54 56 59 61 63/65 66 68 2 5 
14 0 05 10 14/18 22 26 30 33/37 40 43 45 4i 50 52 54 57 so|6l 63 65 66 es 
16 0 04 0812 16 20 24 28/31 34 37 4p 4j 45 48 50 52 54|57 59 61 63 e 
18 © 04108 12 16 20 23|27 30 33 36 30/4, 44 46 49 51|53 56 58 60 
20 004 08 12 16 20|23 27 30 33 36] 39 41 43 46 48 57 
22 0 04 08 12 15 18 22 25 28 31|34 37 40 42 44 2 
24 9 04 08 12/15 18 21 24 37/39 33 36 39 41 30 
26 9 93 07|10 48 17 20 23|26 So. 32 35 38|41 43 45 48 50 
28 9 C501 30 13 17 2023 6 o9 31 sala 2) 42 45 a 
30 9|03 07 10 13 16/19 21 24 27 30|33 37 40 42 
30 [—— ——Á5)-0 10 13 16 19 21 24 27 30,33 37 40 42 45 
32 9 03 07 10 13/16 19 22 2s lay 33 36 38 i 
34 9 03 07 10|13 15 18 20 23/27 30 33 36 A 
36 9 03 06|09 11 14 17 20123 27 30 33 25 
38 9 03/06 09 11 14 17|20 23 27 30 a 
40 K 0/03 06 09 12 15|18 21 24 27 3 
—— Im 06 09 12 15/18 21 24 27 30 
32 0 03 06 09 11|14 17 20 23 2 
44 9 03 06 09|12 15 18 21 2 
46 0 03 06|09 11 14 17 20 
48 0 03/06 09 11 14 17 
2 [| CAES e rera d 0|03 06 09 11 14 
52 0 03 07 09 11 
54 0 03 06 09 
56 0 03 07 
58 0 03 
60 0 


The greater proportion is read along the top scale and the lesser proportion on the vertical scale, If the table is blank where the 
correlation exceeds 0-95, record as 0-95, If the greater proportion exceeds 0:96, or the lesser falls below 0-04, the value is indeter- 
occ: 


minate and only a single reading can be obtained. This will Tarely occur unless the item falls well outside acceptable limits. Each 
entry is read as a decimal—e.g. 37 is read as 0:37, 


Examples of use 


(i) even pairs are read directly 


(iii) two odd numbers 
e.g. 0:76 and 0:64 


(iv) cross pairs differing by 0-01 


——————— 0-77 and 0:63 : «65 
Intersection gives 0-20 at once NGC d 0:79 and 0: 


(ii) one odd number 


0:77 and 0:64 0:63 0-65 
0-77 

One cross pair is 0:24 in each case. 0:20 + 028 24 
The average of the other cross pair is Ms 02 
0:20 + 0-28 —«LÓ 024 + 025 

020. 03 Y — 3 —— 7 024 also 5 = 0245 

2 vs Ignore 0-005 in t] latter. 

Final value: 0:24 7 we 
I——M— T 


Final value: 0-24 
UN 
54 


72 


70 
91 
88 


91 
89 


55 


Table 2: Values of 41—r, where 
r — reliability of an objective test 
Direct reading forr: 0:01-0:75, in 0-01 steps 
0-750—0-995, in 0-005 steps 
0-930—0-993, in 0-001 steps 
0-9930—0-9995, in 0-0005 steps 
If the reliability of an objective test is r, then e,, the 
standard error of a mark is given by: e,—s, V1—r 
(Formula (5): Part 6). The table enables /1—r to 
be read off to four decimal places for a range of r 
from 0 to 1, so that the product: s,V/1— r is a simple 
multiplication. The value of e,, however, should 
normally be recorded to two decimal places only. 
As r approaches 1, /1— r changes rapidly, hence the 
tabulation is made in smaller steps. First differences 
are shown for interpolation between tabular values 
when necessary. Examples are given below. A 
decimal point should be read in front of all values of 
rand V1—r and differences are in terms of the 3rd 
and 4th decimal places of VI =r. 
All first differences, A, are negative 


Examples Direct reading: 
r Vi-r 
0:07 0:9643 
0-765 0:4848 
0-965 0:1871 
0-9985 0:0387 


Interpolation with first differences (which are all negative) 
G) r=0:637, r=0-63, V1—r-0:6083, A= —0:0083 
7 
10* 0-0083=0-0058 (tabular steps in 0-01) 


Required value of V/1— r—0-6083 —0-0058 —0:6025 

Gi) r=0-832, r-0:830, V1—r=0-4123, A= —0-0061 
2 
E x0:0061—0-0024(4) (the tabular values here are 
= in steps of 0-005) 


Required value of V1 —r=0-4123 —0:0024 —0-4099 


r Vi-nA r V1-r| A|| r Vi-r| 4| r |vi-rda 


=} a r |V1-r V1-r| r |Vi—-r| r |vVI=r 
01| 9950 | 50}/ 26} 8602 |58 || 51 | 7000 | 72 | 750 | 5000 |50 875 | 3536 26 
46 | 956 | 2098 || 981 | 1378 
02} 9900 151127] 8544 |59 152] 6928 |72|755| 4950 | 51 880 | 3464 2627 | 957 | 2074 || 982 | 1342 
03 | 9849 |51 28 | 8485 | 59 | 53 | 6856 |74 | 760 | 4899 | 51 885 | 3391 2608 | 958 | 2049 || 983 | 1304 
04| 9798 |51129| 8426 | 59 | 54 | 6782 | 74 | 765 | 4848 | 52 890 | 3317 2588 | 959 | 2025 || 984 | 1265 
05 | 9747 |52|30 | 8367 | 60 |55| 6708 |75 770 | 4796 |53 895 | 3240 2569 | 960 | 2000 | 985 | 1225 
06) 9695 52131 8307 | 61 || 56] 6633 |76 | 775 | 4743 |53 900 | 3162 2 
2550 || 961 | 1975 || 986 | 1183 
07 oo 5 32| 8246 |61157| 6557 | 76 | 780| 4690 |53 905 | 3082 2530 | 962 | 1949 | 987 | 1140 
08 | 9591 52/33 8185 |61 | 58 | 6481 |78 |785| 4637 | 54 910 | 3000 2510 | 963 | 1924 || 988 | 1095 
09 259 52/134 | 8124 | 62|| 59| 6403 |78 | 790 4583 | 55 915 | 2915 2490 || 964 | 1897 | 989 | 1049 
10| 9487 | 53135] 8062 |62 |60| 6325 | 80] 795] 4528 | 56 920 | 2828 2470 | 965 | 1871 || 9900 | 1000 
11 | 9434 |53]36] 8000 |63 | 61 | 6245 |81 |800| 4472 | 56 925 | 2 
739 2449 || 966 | 18 10 | 0949 
12 ae A A Ee 63 || 62| 6164 |81 |805 | 4616 |57 930 | 2646 2429 $61 ns $920 0894 
13 7874 | 64|| 63) 6083 |83 |810| 4359 |58 935 | 2550 2408 || 968 | 1789 || 9930 | 0837 
14} 9273 |54 39 7810 | 6464 | 6000 |84|815| 4301 |58 940 | 2449 2387 || 969 | 1761 || 9935 | 0806 
15| 9219 |55 40! 7746 |65|65| 5916 |85 | 820 4243 | 60 945 | 2345 2366 || 970 | 1732 || 9940 | 0775 
16| 9165 |55 141^ 7681 | 65 |66| 5831 |86 825 | 4183 | 60 950 | 2236 
17| 9110 |55|42| 7616 |66|67 | 5745 |88| 830] 4123 | 61 955 | 2121 na pu Te 9950 0707 
18| 9055 |55|43 | 7550 |67 68 | 5657 | 89| 835] 4062 | 62 960 | 2000 2302 || 973 | 1643 || 9955 | 0671 
19| 9000 |56||44| 7483 |67||69| 5568 |91 |840 | 4000 |63 965 | 1871 2280 || 974 | 1612 || 9960 | 0632 
20| 8944 | 56145 | 7416 |68||70| 5477 |92 |845 | 3937 |64 970 | 1732 | — || 949 | 2258 || 975 | 1581 || 9965 | 0592 
21 | 8888 | 56] 46] 7348 |68||71| 5385 |93 |850| 3873 |65 975 | 1581 | — || 950 | 223 
22| 8832 |57 || 47 | 7280 |69|| 72| 5292 |96|855| 3808 |66 980 | 1414 | — || 951 22 9n B0 9915 0500 
23| 8775 | 57||48| 7211 | 70| 73 | 5196 |97 |860| 3742 | 68 985 | 1248 | — || 952 | 2191 || 978 | 1483 | 9980 | 0447 
24| 8718 |58||49| 7141 |70] 74 | 5099 |99 |865 | 3674 | 68 990 | 1000 | — | 953 | 2168 | 979 | 1449 | 9985 | 0387 
25| 8660 | 58|| 50] 7071 |71||75| 5000 | — |870| 3606 | 70 995 | 0707 | — | 954 | 2145 | 980 | 1414 || 9990 | 0316 


26| 8602 | 58|51| 7000 |72 875| 3536 |72 


955 | 2121 || 981 | 1378 || 9995 | 0224 


56 


A decimal point is to be inserted before all values of r and 


vi-r, and two zeros are to be read before all A values, i.e. 
A=72 is read as 0-0072. 


Table.3: The normal curve tabulated 
In percentile areas 


This table is presented in this form since it is the 
most useful for examination result analysis. 
Column 1 shows the percentage, P, of the total area 


of the normal curve above the ( 3 values recorded in 
column 2. Since the curve is symmetrical, for values 
of P up to 50%, the deviation, (3) is positive; for 


values of P over 50%, the deviation is negative. 


For example, if P = 63%, (2) = —0-332 
s 


and if P = 37%, () = +0:332 


P greater than 50% P less than 50% 


£ positive 


= negative 


[Graph to illustrate Table 3 in Appendix 2.] 


Examples of the use of the table 
(i) The top 14% of the marks (or of the candidates) 


For P = 14%, (2) = 1-080 


Hence, if x is the average mark of the test, and 
s is its standard deviation of marks, then the 
lowest candidate in the top 14% should score 
X4-1:080s 


Gi) The top 86% of the marks (or of the candidates) 


For P = 86, () = —1:080 


Hence, the lowest candidate in the top 86% 
Should score ¥—1-080s 


The top 86% of candidates separates at the same 
time the bottom 14% of candidates. Hence, the 
identity of the quantity 1-080 in each case, but the 
top mark in the bottom 14% is roughly equivalent 
to the bottom mark in the top 86 %. These candidates 
are in effect next to each other in the order of merit, 
but with large numbers, of course, there could be 
several candidates at each mark level. These mark 
ranges apply to a perfectly normal distribution. In 
practice, as distributions are not usually exactly 
normal, especially with small numbers of candidates, 
these tables are used to give approximate results. 
For large scale examinations, they are adequate for 
predicting cut-off points to achieve given percen- 
tages in various categories of pass, credit, or 
distinction. 


%P | Gs) | %P| Gus) | AP) (xls) | 96P | (xls 
01 | 3-090 || 11 | 1227 | 41 | 0-227 (je) 
02 | 2878 | 12 | 1375 | 42 | 0-202 || 70 | 0-524 
03 | 2748 || 13 | 1-126 || 43 | 01764 || 71 | 0-553 
0-4 | 2652 | 14 | 1-080 | 44 | 0-1510 || 72 | 0-583 
0:5 | 2:576 || 15 | 1-037 || 45 | 01256 || 73 | 0-613 
0-6 | 2512 || 16 | 0:995 | 46 | 0-1005 || 74 | 0-643 
0-7 | 2457 | 17 | 0-954 || 47 | 00754 || 75 | 0:6745 
0:8 | 2409 | 18 | 0-915 || 48 | 0-0503 
0:9 | 2.366 || 19 | 0:878 || 49 | 0-0250 | —| 
1-0 | 2326 | 20 | 0-842 | 50 | 0-0 76 | 0-706 
21 | 0:807 77 | 0:739 
aml 22 | 0-772 | 78 | 0-772 
15 | 2-170 | 23 | 0-739 | Necative || 79 | 0:807 
2:0 | 2054 | 24 | 0-706 VALUES 80 | 0-842 
2.5 | 1:960 || 25 | 0:6745| 51 | 0-0250 || 81 | 0-878 
3:0 | 1-881 52 | 00503 |} 82 | 0-915 
3.5 | 1-811 53 | 00754 || 83 | 0-954 
4-0 | 1-751 || 26 | 0-643 | 54 | 0-1005 | 84 | 0-995 
45 | 1:696 | 27 | 0-613 | 55 | 0-1256 || 85 | 1-037 
50 | 1-645 || 28 | 0-583 || 56 | 0-1510 | 86 | 1-080 
5-5 | 1-598 || 29 | 0-553 || 57 | 0-1764 || 87 | 1-126 
60 | 1:555 || 30 | 0-524 || 58 | 0-202 || 88 | 1-175 
65 | 1-514 || 31 | 0496 | 59 | 0-227 | 89 | 1227 
70 | 1476 || 32 | 0-468 | 60 | 0-253 || 90 | 1282 
75 | 1-439 | 33 | 0-440 | 61 | 0279. | 91 | 1341 
8-0 | 1-405 || 34 | 0-412 | 62 | 0306 || 92 | 1-405 
85 | 1372 | 35 | 0385 | 63 | 0332 | 93 | 1476 
9.0 | 1341 || 36 | 0:358 || 64 | 0358. | 94 | 1:555 
9-5 | 1311 || 37 | 0:332 | 65 | 0385 | 95 | 1-645 
10-0 | 1-282 | 38 | 0:306 | 66 | 0-412 | 96 | 1-751 
39 | 0279 | 67 | 0-440 | 97 | 1-881 
= a 40 | 0253 | 68 | 0-468 || 98 | 2-054 
69 | 0496 | 99 | 2:326 


Values from 99-1 to 99-9 can be read at the top of the table 
in the range 0-1 to 0-9, in reverse. 


57 


Longmans 


This handbook, the Product of the author's long teaching 
and testing experience in the United Kingdom and overseas, 
has been written to help teachers gain some ideas of the 
methods of construction and analysis of objective tests. 
In addition to detailed discussions of the techniques of 
setting and marking, and specimen examples in various 
subjects, it Provides an impartial and reasoned survey of 


the limitations as well as the advantages of this method of 
examination, 


